Lecture Notes in Bioinformatics 2666 EditedbyS.Istrail,P.Pevzner,andM.Waterman EditorialBoard: A.Apostolico S.Brunak M.Gelfand T.Lengauer S.Miyano G.Myers M.-F.Sagot D.Sankoff R.Shamir T.Speed M.Vingron W.Wong Subseries of Lecture Notes in Computer Science 3 Berlin Heidelberg NewYork HongKong London Milan Paris Tokyo Concettina Guerra Sorin Istrail (Eds.) Mathematical Methods for Protein Structure Analysis and Design C.I.M.E. Summer School Martina Franca, Italy, July 9-15, 2000 Advanced Lectures 1 3 SeriesEditors SorinIstrail,CeleraGenomics,AppliedBiosystems,Rockville,MD,USA PavelPevzner,UniversityofCalifornia,SanDiego,CA,USA MichaelWaterman,UniversityofSouthernCalifornia,LosAngeles,CA,USA VolumeEditors ConcettinaGuerra Universita`degliStudidiPadova DipartimentodiIngegneriadell’Informazione viaGradenigo6a,35131Padova,Italy E-mail:[email protected] SorinIstrail CeleraGenomics,AppliedBiosystems 45WestGudeDrive,Rockville,MD20850,USA E-mail:[email protected] Cataloging-in-PublicationDataappliedfor AcatalogrecordforthisbookisavailablefromtheLibraryofCongress. BibliographicinformationpublishedbyDieDeutscheBibliothek. DieDeutscheBibliothekliststhispublicationintheDeutscheNationalbibliografie; detailedbibliographicdataisavailableintheInternetat<http://dnb.ddb.de>. CRSubjectClassification(1998):J.3,F.2,H.2,G.2,I.3.5,I.4 ISSN0302-9743 ISBN3-540-40104-0Springer-VerlagBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting, reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer-Verlag.Violationsare liableforprosecutionundertheGermanCopyrightLaw. Springer-VerlagBerlinHeidelbergNewYork amemberofBertelsmannSpringerScience+BusinessMediaGmbH http://www.springer.de ©Springer-VerlagBerlinHeidelberg2003 PrintedinGermany Typesetting:Camera-readybyauthor,dataconversionbyBollerMediendesign Printedonacid-freepaper SPIN:10927403 06/3142 543210 Preface The paperscollectedinthis volumereproducecontributionsbyleadingschol- arstoaninternationalschoolandworkshopwhichwasorganizedandheldwith thegoaloftakingasnapshotofadisciplineundertumultuousgrowth.Indeed, the area of protein folding, docking and alignment is developing in response to needs for a mix of heterogeneous expertise spanning biology, chemistry, mathematics, computer science, and statistics, among others. Some of the problems encountered in this area are not only important for thescientificchallengestheypose,butalsoforthe opportunitiesthey disclose intermsofmedicalandindustrialexploitation.Atypicalexampleisofferedby protein-druginteraction(docking),aproblemposingdauntingcomputational problems at the crossroads of geometry, physics and chemistry, and, at the same time, a problem with unimaginable implications for the pharmacopoeia of the future. Theschoolfocusedonproblemsposedbythestudyofthemechanismsbe- hind protein folding, and explored different ways of attacking these problems under objective evaluations of the methods. Together with a relatively small core of consolidated knowledge and tools, important reflections were brought to this effort by studies in a multitude of directions and approaches. It is obviously impossible to predict which, if any, among these techniques will prove completely successful, but it is precisely the implicit dialectic among them that best conveys the current flavor of the field. Such unique diversity and richness inspired the format of the meeting, and also explains the slight departure of the present volume from the typical format in this series: the exposition of the current sediment is complemented here by a selection of qualified specialized contributions. The topics coveredin this volume pinpoint major issues arising in the de- velopment and analysis of models, algorithms and software tools centered on thestructureofproteins,allofwhichplaycrucialrolesinstructuralgenomics andproteomics.Thestudyof3Dconformationsandrelationshipsamongpro- teins is motivated by the belief that the spatial structure, more than the primarysequence,dictatesthefunctionofaprotein.Thelargestrepositoryof VI Preface 3D protein structures is the Protein Data Bank (PDB), currently containing about 17,000 proteins. The PDB has experienced a sustained growth and is expected to continue to grow at an increasing pace in the near future. The available structures are classified into a relatively small number of families andfolds,accordingtotheirthree-dimensionalconformation.While thenum- ber of proteins will continue to grow, it is widely believed that the number of new folds will remain relatively stable. Structural comparisons involving these structures are at the core of docking and the classification of proteins and sub-aggregates, and motif searches in sequence and protein databases, and ultimately they contribute to understanding the mechanics of folding in living organisms. The first three chapters of this volume contain material that was pre- sented at the school. The chapter entitled “Protein Structure Comparison: Algorithms and Applications,” by G. Lancia and S. Istrail, focuses on the algorithmic aspects and applications of the problem of structure comparison. Structure similarity scoring schemes used in pairwise structure comparison are discussed with respect to the ability to capture the biological relevance of the chemical and physical constraints involved in molecular recognition. Particular attention is paid to the measures based on contact map similarity. The chapter “Spatial PatternDetection in Structural Bioinformatics,” by H.J.Wolfson,discussesthetaskofproteinstructuralcomparisonaswellasthe predictionofprotein-protein,protein-DNAorprotein-druginteraction(dock- ing). Different protein shape representations are used in biological pattern discovery. The paper discusses the shape representations best suited to each computational task, then outlines some rigid and flexible protein structural alignmentalgorithms,anddiscussesthe issuesofrigidboundversusunbound and flexible docking. The chapter “Geometric Methods for Protein Structure Comparison,” by C.FerrariandC.Guerra,discusses,fromatheoreticalpointofview,geometric solutionstotheproblemoffindingcorrespondencesbetweensetsofgeometric features,suchaspointsorsegments.After reviewingexistingmethodsforthe estimation of rigid transformationsunder different metrics, the paper focuses ontheuseofthesecondarystructuresofproteinsforfastretrievalofsimilarity. It also deals with the integrationof strategies using different levels of protein representations, from atomic to secondary structure level. The chapter “Identifying Flat Regions and Slabs in Protein Structures,” byM.E.BockandC.Guerra,presentsgeometricapproachestotheextraction of planar surfaces, which is motivated by the problem of identifying packing regions in proteins. Thetwocontributions,“OPTIMA:aNewScoreFunctionfortheDetection ofRemoteHomologs,”byM.KannandR.A.Goldstein,and“AComparisonof Methods for Assessingthe StructuralSimilarityofProteins,”by D.C. Adams and G.J.P. Naylor, deal with the problem of protein comparison, focusing on different similarity functions for sequence and structure comparison. Preface VII Thenextthreepapers,“PredictionofProteinSecondaryStructureatHigh AccuracyUsingaCombinationofManyNeuralNetworks,”byC.Lundegaard, T.N. Petersen, M. Nielsen, H. Bohr, J. Bohr, S. Brunak, G. Gippert and O. Lund, “Self-consistentKnowledge-BasedApproachto ProteinDesign,” by A. Rossi, C. Micheletti, F. Seno, A. Maritan, and “Learning Effective Amino- Acid Interactions,” by F. Seno, C. Micheletti, A. Maritan and J.R. Banavar, discuss techniques and criteria for protein folding and design. The paper “Protein structure from solid-state NMR,” by J.R. Quine and T.A. Cross,presents a mathematicalanalysisfor solid-statenuclearmagnetic resonance(NMR).Finally,thecontribution“Protein-likePropertiesofSimple Models,”byY.-H.SanejouandandG.Trinquier,focusesonpropertiesrelevant to the sequence-structure relationships. The school was attended by 56 participants from 10 countries. Lectures were given by Prof. Ken Dill, University of California (USA), Prof. Arthur Lesk, University of Cambridge Clinical School (UK), Prof. Michael Levitt, Stanford University School of Medicine (USA), Prof. John Moult, University of Maryland (USA), and Prof. Haim Wolfson, Tel Aviv University (Israel) Invited talks at the workshop were given by Prof. Mary Ellen Bock, Purdue University (USA) and Dr. Andrea Califano, IBM Yorktown (USA). Concettina Guerra Sorin Istrail Contents Protein Structure Comparison: Algorithms and Applications Giuseppe Lancia, Sorin Istrail ..................................... 1 1 Introduction .................................................. 1 2 Preliminaries ................................................. 4 3 Applications of Structure Comparisons........................... 6 4 Software and Algorithms for Structure Comparison ................ 11 5 Problems Based on Contact Map Representations ................. 20 6 Acknowledgements ............................................ 30 References ...................................................... 30 Spatial Pattern Detection in Structural Bionformatics Haim J. Wolfson................................................. 35 1 Introduction .................................................. 35 2 Protein Shape Representation................................... 37 3 Protein Structural Alignment ................................... 38 4 Protein-ProteinDocking........................................ 46 5 Summary..................................................... 52 References ...................................................... 53 Geometric Methods for Protein Structure Comparison Carlo Ferrari, Concettina Guerra .................................. 57 1 Introduction .................................................. 57 2 Protein Description............................................ 59 3 Structural Comparison: Problem Formulation ..................... 62 4 Representation of Rigid Transformations ......................... 63 5 Determination of 3D Rigid Transformations....................... 68 6 Geometric Pattern Matching.................................... 71 7 Indexing Techniques ........................................... 73 8 Graph-Theoretic Approaches.................................... 76 9 Integration of Methods for Protein Comparison Using Different Representations ............................................... 77 X Contents 10Conclusions................................................... 78 11Acknowledgements ............................................ 79 References ...................................................... 79 Identifying Flat Regions and Slabs in Protein Structures Mary Ellen Bock, Concettina Guerra ............................... 83 1 Introduction .................................................. 83 2 A Geometric Algorithm ........................................ 85 3 An Improved Geometric Algorithm .............................. 87 4 Hough Transform.............................................. 88 5 Performances of the Two Algorithms............................. 90 6 Plane Detection in Proteins..................................... 92 7 Acknowledgements ............................................ 95 References ...................................................... 96 OPTIMA: A New Score Function for the Detection of Remote Homologs Maricel Kann, Richard A. Goldstein................................ 99 1 Abstract ..................................................... 99 2 Introduction .................................................. 99 3 Methods .....................................................100 References ......................................................107 A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams, Gavin J.P. Naylor ................................109 1 Introduction ..................................................109 2 The DALI Algorithm ..........................................109 3 The Root Mean Square Algorithm...............................110 4 Geometric Morphometrics ......................................111 5 Comparison of Methods ........................................111 6 Discussion....................................................113 References ......................................................114 Prediction of Protein Secondary Structure at High Accuracy Using a Combination of Many Neural Networks Claus Lundegaard, Thomas Nordahl Petersen, Morten Nielsen, Henrik Bohr, Jacob Bohr, Søren Brunak, Garry Gippert, Ole Lund.....117 1 Summary.....................................................117 2 Introduction ..................................................117 3 Methods .....................................................118 4 Results.......................................................119 References ......................................................121