Modeling the Free Energy Landscape of Biomolecules via Dihedral Angle Principal Component Analysis of Molecular Dynamics Simulations Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften vorgelegt beim Fachbereich Biochemie, Chemie und Pharmazie der Goethe-Universit¨at in Frankfurt am Main von Alexandros Altis aus Frankfurt am Main Frankfurt am Main 2008 (D 30) vom Fachbereich Biochemie, Chemie und Pharmazie der Goethe-Universit¨at Frankfurt am Main als Dissertation angenommen. Dekan: Prof. Dr. Dieter Steinhilber 1. Gutachter: Prof. Dr. Gerhard Stock 2. Gutachter: JProf. Dr. Karin Hauser Datum der Disputation: ............................................. Contents 1 Introduction 1 2 Dihedral Angle Principal Component Analysis 7 2.1 Introduction to molecular dynamics simulation . . . . . . . . . . . . . . . . 9 2.2 Definition and derivation of principal components . . . . . . . . . . . . . . 11 2.3 Circular statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Dihedral angle principal component analysis (dPCA) . . . . . . . . . . . . 18 2.5 A simple example - trialanine . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Interpretation of eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Complex dPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 Energy landscape of decaalanine . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Cartesian PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.10 Direct angular PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.12 Nonlinear principal component analysis . . . . . . . . . . . . . . . . . . . . 42 2.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Free Energy Landscape 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Dimensionality of the free energy landscape . . . . . . . . . . . . . . . . . 53 3.4 Geometric and kinetic clustering . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Markovian modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Visualization of the free energy landscape . . . . . . . . . . . . . . . . . . . 62 iii iv CONTENTS 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 Dynamics Simulations 67 4.1 Dynamical systems and time series analysis . . . . . . . . . . . . . . . . . . 68 4.2 How complex is peptide folding? . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Multidimensional Langevin modeling . . . . . . . . . . . . . . . . . . . . . 81 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Applications to larger systems - an outlook 85 5.1 Free energy landscapes for the villin system . . . . . . . . . . . . . . . . . 86 5.2 Langevin dynamics for the villin system . . . . . . . . . . . . . . . . . . . 89 5.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 Appendix 95 6.1 Transformation of probability densities . . . . . . . . . . . . . . . . . . . . 95 6.2 Complex dPCA vs. dPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Integrating out Gaussian-distributed degrees of freedom . . . . . . . . . . . 98 6.4 Molecular dynamics simulation details . . . . . . . . . . . . . . . . . . . . 99 6.5 Source code in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 References 107 Acknowledgments 117 Deutsche Zusammenfassung 119 Curriculum Vitae 124 Publications 124 Chapter 1 Introduction Proteins can be regarded as the most important building blocks of our body. They function as mechanical tools, perform transport (e.g., hemoglobin) and communication, catalyze biochemical reactions, and are involved in many other essential processes of life. The native structure to which a protein folds by the process of protein folding determines its biological function. To answer the protein folding problem of how the amino acid sequence of a protein as synthesized by ribosomes dictates its structure, one has to un- derstand the complex dynamics of protein folding. In the folding process the transition between metastable conformational states plays a crucial role. These are long-lived in- termediates, which for proteins can have lifetimes up to microseconds before undergoing further transitions. Experiments using nuclear magnetic resonance (NMR) spectroscopy or X-ray crystal- lography can provide structural information on the native state or sometimes metastable states [1]. But as a system quickly relaxes to a lower energy state, the dynamics of the process of folding is hard to assess by experiment. In addition, traditional experiments provide only average quantities such as mean structures, not distributions and variations. Molecular dynamics computer simulations are used to obtain a deeper understanding of the dynamics and mechanisms involved in protein folding [2]. Molecular dynamics simulations have become a popular and powerful approach to describe the structure, dynamics, and function of biomolecules in atomic detail. In the past few years, computer power has increased such that simulations of small peptides on the timescale of microseconds are feasible by now. With the help of worldwide distributed 1 2 CHAPTER 1. INTRODUCTION computing projects as Folding@home [3] even folding simulations of small microsecond and submicrosecond folding proteins are possible [4]. Markov chain models constructed frommoleculardynamicstrajectoriescouldprovepromisingforthemodelingofthecorrect statistical conformational dynamics over much longer times than the molecular dynamics simulations used as input [5–7]. Unfortunately, it is neither trivial to define the discrete statesforaMarkovapproach, norisitclearwhetherthesystemunderconsiderationobeys the Markov property. As molecular dynamics simulations result in huge data sets which need to be analyzed, one needs methods which filter out the essential information. For example, biomolecular processes such as molecular recognition, folding, and aggregation can all be described in terms of the molecule’s free energy [8–10] ∆G(r) = k T[lnP(r) lnP ]. (1.1) B max − − Here P is the probability distribution of the molecular system along some (in general multidimensional) coordinate r and P denotes its maximum, which is subtracted to max ensure that ∆G = 0 for the lowest free energy minimum. Popular choices for the co- ordinate r include the fraction of native contacts, the radius of gyration, and the root mean square deviation of the molecule with respect to the native state. The probabil- ity distribution along these “order parameters” may be obtained from experiment, from a theoretical model, or a computer simulation. The resulting free energy “landscape” has promoted much of the recent progress in understanding protein folding [8–12]. Be- ing a very high-dimensional and intricate object with many free energy minima, finding good order parameters is essential for extracting useful low-dimensional models of con- formational dynamics of peptides and proteins. For the decomposition of a system into a relevant (low-dimensional) part and an irrelevant part principal component analysis has become a crucial tool [13]. Principal component analysis (PCA), also called quasiharmonic analysis or essential dynamics method [14–17], is one of the most popular methods to systematically reduce the dimensionality of a complex system. The approach is based on the covariance matrix, which provides information on the two-point correlations of the system. The PCA rep- resents a linear transformation that diagonalizes the covariance matrix and thus removes 3 the instantaneous linear correlations among the variables. Ordering the eigenvalues of the transformation decreasingly, it has been shown that a large part of the system’s fluctua- tions can be described in terms of only a few principal components which may serve as reaction coordinates [14–20] for the free energy landscape. SomePCAmethodsusinginternal(insteadofCartesian)coordinates[21–27]havebeen proposed in the literature. In biomolecules, in particular the consideration of dihedral angles appears appealing, because other internal coordinates such as bond lengths and bond angles usually do not undergo changes of large amplitudes. Due to the circularity of the angular variables it is nontrivial to apply methods such as PCA for the analysis of molecular dynamics simulations. This work presents a contribution to the literature on methods in search of low- dimensionalmodelsthatyieldinsightintotheequilibriumandkineticbehaviorofpeptides and small proteins. A deep understanding of various methods for projecting the sampled configurations of molecular dynamics simulations to obtain a low-dimensional free energy landscape is acquired. Furthermore low-dimensional dynamic models for the conforma- tional dynamics of biomolecules in reduced dimensionality are presented. As exemplary systems, mainly short alanine chains are studied. Due to their size they allow for perform- ing long simulations. They are simple, yet nontrivial systems, as due to their flexibility they are rapidly interconverting conformers. Understanding these polypeptide chains in great detail is of considerable interest for getting insight in the process of protein folding. For example, K. Dill et al. conclude in their review [28] about the protein folding problem that “the once intractable Levinthal puzzle now seems to have a very simple answer: a protein can fold quickly and solve its large global optimization puzzle simply through piecewise solutions of smaller component puzzles”. The thesis is organized as follows: Chapter 2 provides the theoretical foundations of the dihedral angle principal component analysis (dPCA) for the analysis of the dynamics of the φ,ψ backbone dihedral angles. In an introduction to circular statistics we thor- oughly discuss the implications of the proposed sin/cos-transformation of the dihedral angles which comes along with a doubling of variables from N angular variables to 2N Cartesian-like ones. It is shown that indeed this transformation can truthfully represent the original angle distribution without generating spurious results. Furthermore, we show 4 CHAPTER 1. INTRODUCTION that the dPCA components can readily be characterized by the conformational changes of the peptide. For the trialanine system the equivalence between a Cartesian PCA and the dPCA is demonstrated. We then introduce a complex valued version of the dPCA which sheds some light on the doubling of variables occurring in the sin/cos dPCA. The devel- oped concepts are demonstrated and applied to a 300 ns molecular dynamics simulation of the decaalanine peptide. What follows is a detailed study of the similarities and differences of various PCA methods. The dPCA is evaluated in comparison to alternative projection approaches. In particular, it is shown that Cartesian PCA fails to reveal the true structure of the free energy landscape of small peptides, except for the conformationally trivial example trialanine. Thesmooth appearance of thelandscape is an artifact of the mixing of internal and overall motion. This is demonstrated using a 100 ns and an 800 ns simulation of pentaalanine and heptaalanine, respectively. In addition, the dPCA is compared to a PCA which operates directly on the dihedral angles, thus avoiding a doubling of variables. Various drawbacks of such a method which doesn’t properly take the circularity of the variables into account are discussed. The dPCA is also compared to a version using the correlation matrix instead of the covariance matrix. Finally, it is concluded that, for the cases studied, the dPCA provides the most detailed low-dimensional representation of the free energy landscape. The chapter ends with a correlation analysis for the dihedral angles of heptaalanine which is compared to results from the literature, and some remarks about nonlinear PCAs. Based on the dPCA, Chapter 3 presents a systematic approach to construct a low- dimensionalfreeenergylandscapefromaclassicalmoleculardynamicssimulation. Demon- strating that a representation of the free energy landscape in too less dimension can lead to serious artifacts and oversimplifications of this intricate surface, it is attempted to answer the question on how many dimensions or PCs need to be taken into account in order to appropriately describe a given biomolecular process. It is shown that this di- mensionality can be determined from the distribution and the autocorrelation of the PCs. Employing an 800 ns simulation of heptaalanine using geometric and kinetic clustering techniques, it is shown that a five-dimensional dPCA energy landscape is appropriate for reproducing the correct number, energy, and location of the system’s metastable states 5 and barriers. After presenting several ways to visualize the free energy landscape using transition networks and a disconnectivity graph, we close the chapter with conclusions. Having constructed low-dimensional free energy landscapes, the remaining aim is to construct dynamic models in this reduced dimensionality. Chapter 4 is concerned with the construction of low-dimensional models for peptide and protein dynamics from the point of view of modern nonlinear dynamics. Using methods from nonlinear time series analysis a deterministic model of the dynamics is developed and applied to molecular dynamics simulations of short alanine polypeptide chains. The well-established concept of the complexity of a dynamical system is applied to folding trajectories. Very interestingly, while the dimension of the free energy landscape increases with system size, the Kaplan- Yorke dimension may decrease. This suggests that the molecular dynamics generates less and less chaotic orbits as the length of the peptide chains increases. Furthermore, we introduce a mixed deterministic stochastic model for the conformational dynamics in reduced dimensionswhich is based on theestimation of thedriftand diffusion vector fields of a Langevin equation. This makes it possible to, e.g., study nonequilibrium dynamics as relaxation to the folded state of a protein. Finally, in Chapter 5 we apply some of the developed techniques to a larger system, namely a variant of the villin headpiece subdomain (HP-35 NleNle). Using many hun- dreds of molecular dynamics trajectories as obtained from Folding@home, we analyze the resulting free energy landscape for this system. In a next step we attempt to find a good dynamic model using the Langevin ansatz as described in the last chapter. We finally estimate folding times for this system, and conclude with an outlook. Conclusions are drawn at the end of each chapter. 6 CHAPTER 1. INTRODUCTION
Description: