Investigation into the role of sequence- driven-features and amino acid indices for the prediction of structural classes of proteins By Mr. Sundeep Singh Nanuwa Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of De Montfort University April 2013 Abstract The work undertaken within this thesis is towards the development of a representative set of sequence driven features for the prediction of structural classes of proteins. Proteins are biological molecules that make living things function, to determine the function of a protein the structure must be known because the structure dictates its physical capabilities. A protein is generally classified into one of the four main structural classes, namely all-α, all-β, α + β or α / β, which are based on the arrangements and gross content of the secondary structure elements. Current methods manually assign the structural classes to the protein by manual inspection, which is a slow process. In order to address the problem, this thesis is concerned with the development of automated prediction of structural classes of proteins and extraction of a small but robust set of sequence driven features by using the amino acid indices. The first main study undertook a comprehensive analysis of the largest collection of sequence driven features, which includes an existing set of 1479 descriptor values grouped by ten different feature groups. The results show that composition based feature groups are the most representative towards the four main structural classes, achieving a predictive accuracy of 63.87%. This finding led to the second main study, development of the generalised amino acid composition method (GAAC), where amino acid index values are used to weigh corresponding amino acids. GAAC method results in a higher accuracy of 68.02%. The third study was to refine the amino acid indices database, which resulted in the highest accuracy of 75.52%. The main contributions from this thesis are the development of four computationally extracted sequence driven feature-sets based on the underused amino acid indices. Two of these methods, GAAC and the hybrid method have shown improvement over the usage of traditional sequence driven features in the context of smaller and refined feature sizes and classification accuracy. The development of six non-redundant novel sets of the amino acid indices dataset, of which each are more representative than the original database. Finally, the construction of two large 25% and 40% homology datasets consisting over 5000 and 7000 protein samples, respectively. A public webserver has been developed located at http://www.generalised- protein-sequence-features.com, which allows biologists and bioinformaticians to extract GAAC sequence driven features from any inputted protein sequence. Keywords: protein structural classes, sequence driven features, amino acid indices, test procedures, largest protein structural class datasets, generalised amino acid composition 2 Publications Nanuwa, S. S. and H. Seker (2008). Investigation into the role of sequence-driven-features for prediction of protein structural classes. 8th IEEE International Conference on Bioinformatics and BioEngineering, BIBE 2008, Athens Greece. Volume 1. Pages 583-586 Nanuwa, S. S., A. Dziurla and H. Seker. (2009). Weighted amino acid composition based on amino acid indices for prediction of protein structural classes. Final Program and Abstract Book - 9th International Conference on Information Technology and Applications in Biomedicine, ITAB 2009, Larnaca Cyprus. Volume 1. Pages 327-332 Nanuwa, S. S. and H. Seker (2011). Prediction of a protein’s structural class using amino acid indices. The International Congress on Bioinformatics and Biomics, 2009 Izmir Turkey 3 This thesis is dedicated to my wife Sharan, dad Amarjit, mum Harjinder, Brother Jamie & the Nanuwa family. 4 Acknowledgements I would like to offer my sincerest gratitude to my supervisor, Dr Huseyin Seker, who has supported me throughout my Masters and PhD. The work undertaken is through his encouragement, which gave me the determination to complete the thesis. One simply could not wish for a better supervisor. I would like to also offer my sincerest gratitude to my current place of work The Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory (JDRF/WT DIL), centred at the Cambridge Institute for Medical Research at University of Cambridge, allowing me to have the time needed to complete the thesis. 5 Table of Contents Abstract ....................................................................................................................................... 2 Publications ................................................................................................................................ 3 Acknowledgements .................................................................................................................. 5 Acronyms .................................................................................................................................. 19 Mathematical Symbols ........................................................................................................... 20 1 Chapter 1 - Introduction ..................................................................................................... 21 1.1 Background ................................................................................................................. 21 1.2 Prediction of structural classes of proteins ................................................................ 23 1.3 Sequence-driven-features & Amino acid indices ........................................................ 24 1.4 Organisation of PhD thesis .......................................................................................... 25 2 Chapter 2 - Literature Review ............................................................................................. 27 2.1 Introduction ................................................................................................................ 27 2.2 Cells, DNA, Proteins .................................................................................................... 27 2.2.1 Primary Structure ................................................................................................ 38 2.2.2 Secondary structure ............................................................................................ 39 2.2.3 Tertiary structure ................................................................................................ 42 2.2.4 Current experimental procedures to determine protein structures .................. 43 2.2.5 Transition from secondary structures to structural classes ................................ 44 2.2.6 Structural classes ................................................................................................ 44 2.3 Bioinformatics for prediction of structural classes of proteins .................................. 50 2.3.1 Real world bioinformatics applications ............................................................... 51 2.3.2 Bioinformatics and Proteomics ........................................................................... 51 2.3.3 Bioinformatics for the prediction for structural classes of proteins by using sequence information ......................................................................................................... 51 2.3.4 Data resources .................................................................................................... 52 2.3.5 Datasets constructed using PDB and SCOP ......................................................... 52 6 2.3.6 Sequence driven features for protein representation ........................................ 54 2.3.7 Amino acid indices .............................................................................................. 55 2.3.8 Predictive models ................................................................................................ 55 2.3.9 Assessment of the predictive models (test procedures) .................................... 57 2.3.10 Sequence homology ............................................................................................ 58 2.3.11 Feature selection................................................................................................. 59 2.3.12 Current prediction accuracies ............................................................................. 62 2.4 Conclusions ................................................................................................................. 67 3 Chapter 3 - Materials and Methods .................................................................................... 68 3.1 Introduction ................................................................................................................ 68 3.2 Datasets ...................................................................................................................... 68 3.3 Dataset filtering .......................................................................................................... 71 3.4 Classification algorithms ............................................................................................. 72 3.4.1 K-nearest neighbour classifier ............................................................................ 73 3.4.2 Support Vector Machine ..................................................................................... 77 3.4.3 Differences between KNN and SVM ................................................................... 80 3.4.4 Classification performance ................................................................................. 80 3.4.5 Test procedures .................................................................................................. 81 3.5 Hierarchical clustering ................................................................................................ 84 3.5.1 Bioinformatics application of hierarchical clustering ......................................... 86 3.6 Principal Component Analysis..................................................................................... 86 3.6.1 Bioinformatics application of PCA ....................................................................... 88 3.7 Conclusions ................................................................................................................. 88 4 Chapter 4 - Analysis of existing sequence driven features ................................................. 90 4.1 Introduction ................................................................................................................ 90 4.2 Sequence representation: Sequence-driven features ................................................ 90 4.3 Sequence driven features technical details ................................................................ 91 7 4.3.1 Amino Acid Composition ..................................................................................... 92 4.3.2 Dipeptide Composition ....................................................................................... 93 4.3.3 Autocorrelation feature groups .......................................................................... 94 4.3.4 Composition, Transition and Distribution ........................................................... 97 4.4 Sequence Order ........................................................................................................ 103 4.5 Pseudo amino acid composition ............................................................................... 106 4.6 Results and discussion .............................................................................................. 109 4.6.1 Results for amino acid composition feature group .......................................... 112 4.6.2 Results for dipeptide composition feature group ............................................. 112 4.6.3 Results for autocorrelation feature groups ...................................................... 112 4.6.4 Results for composition feature group ............................................................. 115 4.6.5 Results for transition and distribution feature group ....................................... 116 4.6.6 Results for pseudo amino acid composition ..................................................... 116 4.6.7 Results of test procedures performance .......................................................... 120 4.6.8 Individual class performance ............................................................................ 128 4.7 Conclusions ............................................................................................................... 130 5 Chapter 5 - Amino acid indices based sequence driven features ..................................... 133 5.1 Introduction .............................................................................................................. 133 5.2 Amino acid indices .................................................................................................... 133 5.3 Amino acid indices database..................................................................................... 134 5.3.1 Normalisation of amino acid indices ................................................................. 135 5.4 Novel feature extraction methods based on amino acid indices ............................. 136 5.4.1 Hybrid computational method for the analysis of amino acid indices – method 1 136 5.4.2 Generalised amino acid composition – method 2 ............................................ 139 5.4.3 Novel feature extraction methods over sequence representation matrix based on amino acid indices........................................................................................................ 140 8 5.4.4 Sequence representation matrix ...................................................................... 140 5.4.5 Feature extraction using the mean of sequence representation matrix - method 3 143 5.4.6 Feature extraction using principal component analysis over sequence representation matrix - method 4 .................................................................................... 143 5.5 Results ....................................................................................................................... 146 5.5.1 Hybrid computational method for the analysis of amino acid indices reveals novel indices – Method 1 .................................................................................................. 146 5.5.2 Assessment of amino acid indices for GAAC - method 2 .................................. 156 5.5.3 Comparison with the published and benched mark study results ................... 156 5.5.4 Individual class performance ............................................................................ 165 5.5.5 Assessment of performance based on test procedures ................................... 166 5.5.6 Results obtained using the novel feature extraction methods based on amino acid indices – methods 3 and 4 ......................................................................................... 170 5.6 Generalised Amino Acid Composition webserver .................................................... 173 5.7 Conclusions ............................................................................................................... 176 6 Chapter 6 - Feature selection ............................................................................................ 178 6.1 Introduction .............................................................................................................. 178 6.2 Feature selection categories ..................................................................................... 178 6.3 F-select ...................................................................................................................... 180 6.4 Minimum redundancy maximum relevance feature selection................................. 181 6.5 Results from feature selection methods ................................................................... 183 6.5.1 Feature selection results over the traditional sequence driven features presented in c hapter 4 ..................................................................................................... 183 6.5.2 Feature selection results based on the sequence driven features presented in chapter 5 – method 3 ........................................................................................................ 187 6.5.3 Feature selection results based on the sequence driven features presented in chapter 5– method 4 ........................................................................................................ 187 9 6.6 Conclusions ............................................................................................................... 188 7 Chapter 7 – Discussion, conclusion and future work ........................................................ 190 7.1 Introduction .............................................................................................................. 190 7.2 Critical evaluation of traditional sequence-driven features ..................................... 190 7.2.1 Composition based sequence-driven-feature groups ...................................... 190 7.2.2 Autocorrelation feature groups ........................................................................ 191 7.2.3 Composition, transition and distribution feature groups ................................. 192 7.2.4 Pseudo amino acid composition ....................................................................... 192 7.3 Critical evaluation of amino acid indices based sequence-driven-features ............. 193 7.3.1 Updated amino acid indices dataset ................................................................. 193 7.3.2 Generalised amino acid composition ................................................................ 194 7.3.3 Identification of a candidate set of amino acid indices .................................... 194 7.3.4 Generalised amino acid composition webserver .............................................. 196 7.3.5 Amino Acid Indices based sequence-driven-feature extraction methods ........ 196 7.3.6 Hybrid sequence driven feature extraction ...................................................... 197 7.4 Feature selection ...................................................................................................... 198 7.5 Test procedures ........................................................................................................ 199 7.6 Assessment of multiple k-nearest neighbour ........................................................... 199 7.7 Conclusions ............................................................................................................... 201 7.8 Future work ............................................................................................................... 204 References ................................................................................................................................ 205 Appendix I – Sequence Driven Features ................................................................................... 216 Appendix II Full list of amino acid indices from the AAindex database .................................... 218 Appendix III Full list of amino acid indices found through literature searches ........................ 224 Appendix IV Generated amino acid indices using SINGLE Linkage and Minimum Cluster Distance = 1 ............................................................................................................................... 226 10
Description: