Modeling and Optimization in Science and Technologies Gautam B. Singh Fundamentals of Bioinformatics and Computational Biology Methods and Exercises in MATLAB Modeling and Optimization in Science and Technologies Volume 6 Serieseditors SrikantaPatnaik,SOAUniversity,Orissa,India e-mail:[email protected] IshwarK.Sethi,OaklandUniversity,Rochester,USA e-mail:[email protected] XiaolongLi,IndianaStateUniversity,TerreHaute,USA e-mail:[email protected] EditorialBoard LiCheng,TheHongKongPolytechnicUniversity,HongKong Jeng-HaurHorng,NationalFormosaUniversity,Yulin,Taiwan PedroU.Lima,InstituteforSystemsandRobotics,Lisbon,Portugal Mun-KewLeong,InstituteofSystemsScience,NationalUniversityofSingapore MuhammadNur,DiponegoroUniversity,Semarang,Indonesia LucaOneto,UniversityofGenoa,Italy KayChenTan,NationalUniversityofSingapore,Singapore SarmaYadavalli,UniversityofPretoria,SouthAfrica Yeon-MoYang,KumohNationalInstituteofTechnology,Gumi,SouthKorea LiangchiZhang,TheUniversityofNewSouthWales,Australia BaojiangZhong,SoochowUniversity,Suzhou,China AhmedZobaa,BrunelUniversity,Uxbridge,Middlesex,UK AboutthisSeries The bookseries ModelingandOptimizationin ScienceandTechnologies(MOST) publishesbasicprinciplesaswellasnoveltheoriesandmethodsinthefast-evolving field of modeling and optimization.Topics of interest include, but are not limited to:methodsforanalysis,designandcontrolofcomplexsystems,networksandma- chines;methodsforanalysis,visualizationandmanagementoflargedatasets;useof supercomputersformodelingcomplexsystems;digitalsignalprocessing;molecular modeling;and tools and software solutionsfor differentscientific and technologi- cal purposes. Special emphasis is given to publications discussing novel theories and practicalsolutions that, by overcomingthe limitations of traditionalmethods, may successfully address modern scientific challenges, thus promoting scientific andtechnologicalprogress.Theseriespublishesmonographs,contributedvolumes andconferenceproceedings,aswellasadvancedtextbooks.Themaintargetsofthe seriesaregraduatestudents,researchersandprofessionalsworkingattheforefront oftheirfields. Moreinformationaboutthisseriesathttp://www.springer.com/series/10577 Gautam B. Singh Fundamentals of Bioinformatics and Computational Biology Methods and Exercises in MATLAB ABC GautamB.Singh DepartmentofComputerScience andEngineering OaklandUniversity Rochester,Michigan USA ISSN2196-7326 ISSN2196-7334 (electronic) ISBN978-3-319-11402-6 ISBN978-3-319-11403-3 (eBook) DOI10.1007/978-3-319-11403-3 LibraryofCongressControlNumber:2014949497 SpringerChamHeidelbergNewYorkDordrechtLondon (cid:2)c SpringerInternationalPublishingSwitzerland2015 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’slocation,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer. PermissionsforusemaybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violations areliabletoprosecutionundertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpub- lication,neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforany errorsoromissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespect tothematerialcontainedherein. Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) To my family Preface The integration of computers in life sciences has been growing for the last two decades. While the first release of GenBank contained a mere half a million DNA sequence bases in 1982, the current release of GenBank has exceeded 100 giga bases of data. With data comes computational challenges foranalysis,interpretation,visualizationandintegrationofinformation.That ina nutshell is the reasonto familiarizeundergraduatestudents incomputer science and engineering with the nature and use of biological data and thus becomepreparedtomeetthedemandsofhightechcareersinthetwenty-first century. The intended audience of this textbook are students in computer science, engineering and information technology at the undergraduate or lower grad- uate level. The material is primarily presented in a simplified manner and extensive details are left out. However, pointers to appropriate references should guide those who are interested in exploring specific topics in greater detail. Topicsinthis textbookareorganizedintothreeparts.Part Iofthisbook provides some backgroundto the field of bioinformatics and an introduction to molecular biology and genetics. A survey of biological databases is also included. The material in this part is considered to be fairly fundamental and should be covered in all courses, graduate and undergraduate. Part II of the book covers methodologies for retrieving information from biological databases and coverssimple boolean searches, sequence alignment algorithms, protein alignment, scoring matrices, alignment tools and bio- linguisticmethods.Undergraduatestudentsshouldcoverbasicretrievaltech- niques and advanced topics such as PAM and BLOSUM may be included based on the amount of time available and level of preparation of the stu- dents. Part III of the book covers the topics related to sequence analysis and covers algorithms for finding patterns and detecting genes. PartIV focusesontopicsinphylogeneticsandsystemsbiologyandcovers thealgorithmsfordistance,characterandprobabilisticmethodsforinferring VIII Preface phylogeny. Also described are some key algorithms for analyzing micro-chip data. The book is an offshoot of our project aimed at creating bioinformatics educational resources for undergraduates in computer science and engineer- ing. This project is sponsored by the National Science Foundation, USA. Additional details for the project and bioinformatics educational resources are available from http://bioflow.secs.oakland.edu. The author would like to acknowledge the efforts by students who partic- ipated in creating resources for the NSF sponsored bioinformatics project: Kenneth DeMonn, Nirmala Venkatraman, David Poe, Guy Lima, Kellie Mc- Gowanand Ashwin Kottam. Their work influenced the content and the pre- sentation in this text. For the BioFlow project we are also analyzing student learning styles in collaboration with Professor Christine Hansen, Department of Psychol- ogy, Oakland University. The results from obtained from the student as- sessment studies were very valuable in providing insight into methods that made this text more comprehensible for computer science and engineering undergraduates. Rochester, MI Gautam B. Singh USA June 2014 Contents Part I: Background 1 Introduction to Bioinformatics .......................... 3 1.1 What Is Bioinformatics?................................ 3 1.2 The Human Genome Project............................ 4 1.3 Genome Data Statistics ................................ 5 1.4 Applications of Bioinformatics .......................... 7 2 Introduction to Molecular Biology ...................... 11 2.1 Cell Structure......................................... 11 2.1.1 Genome........................................ 13 2.1.2 DNA: Deoxyribonucleic Acid...................... 15 2.1.3 Genes ......................................... 16 2.2 Central Dogma........................................ 18 2.2.1 Replication ..................................... 19 2.2.2 Transcription .................................. 23 2.2.3 Translation .................................... 25 2.3 Gene Expression ...................................... 27 2.4 Gene Linkage ......................................... 28 2.5 DNA Sequencing ...................................... 31 2.6 Summary............................................. 32 2.7 Exercises ............................................. 34 3 Biological Databases..................................... 37 3.1 Nucleotide Databases .................................. 39 3.1.1 GENBANK .................................... 39 3.2 Protein Sequence Databases ............................ 46 3.2.1 Swiss-Prot...................................... 47 3.2.2 PIR ........................................... 51 3.2.3 GenPept ....................................... 52 3.2.4 UniProt Knowledgebase.......................... 52 X Contents 3.3 Biological Patterns Databases........................... 53 3.3.1 PROSITE ...................................... 53 3.3.2 TRANSFAC: Transcription Factors and Regulation.................................. 55 3.4 Genome Viewer ....................................... 57 3.5 Gene Ontology Database ............................... 59 3.5.1 Go Terms ...................................... 59 3.5.2 Associations .................................... 60 3.5.3 MATLAB Interface to GO........................ 62 3.5.4 Example ....................................... 65 3.6 Other Databases ...................................... 66 3.6.1 RefSeq: NCBI Reference Sequences ................ 67 3.6.2 ESTs and UniGene .............................. 68 3.6.3 Structure Databases ............................. 69 3.7 Summary............................................. 69 3.8 Exercises ............................................. 73 4 Processing Biological Sequences with MATLAB ........ 77 4.1 Sequence Acquisition .................................. 77 4.2 Operations on Nucleotide Sequences ..................... 80 4.3 Joining Exons......................................... 83 4.4 An Example .......................................... 84 4.4.1 Download Sequence.............................. 84 4.4.2 Read That Downloaded File ...................... 85 4.4.3 Process Sequence................................ 85 4.4.4 Extracting Stop Codons.......................... 86 4.4.5 Charting Results ................................ 87 4.5 Restriction Site Detection .............................. 87 4.6 Exercises ............................................. 92 Part II: Information Retrieval from Biological Databases 5 Sequence Homology ..................................... 97 5.1 Information Retrieval from Biological Databases........... 97 5.1.1 Entrez ......................................... 98 5.1.2 Search Example................................. 98 5.1.3 Obtaining Sequences Using Matlab ................ 100 5.1.4 Benchmarks .................................... 101 5.2 Dot Plots............................................. 102 5.3 Sequence Alignment ................................... 104 5.3.1 Edit Distance ................................... 105 5.4 Dynamic ProgrammingAlgorithm ...................... 105 5.4.1 Distance-Based Alignment ....................... 107 5.4.2 Similarity-Based Alignment ...................... 110
Description: