NEW ANT COLONY OPTIMISATION ALGORITHMS FOR HIERARCHICAL CLASSIFICATION OF PROTEIN FUNCTIONS a thesis submitted to The University of Kent at Canterbury in the subject of computer science for the degree of doctor of philosophy. By Fernando Esteban Barril Otero January 2010 Abstract Ant colony optimisation (ACO) is a metaheuristic to solve optimisation problems inspired by the foraging behaviour of ant colonies. It has been successfully ap- plied to several types of optimisation problems, such as scheduling and routing, and more recently for the discovery of classification rules. The classification task in data mining aims at predicting the value of a given goal attribute for an exam- ple, based on the values of a set of predictor attributes for that example. Since real-world classification problems are generally described by nominal (categori- cal or discrete) and continuous (real-valued) attributes, classification algorithms are required to be able to cope with both nominal and continuous attributes. Current ACO classification algorithms have been designed with the limitation of discovering rules using nominal attributes describing the data. Furthermore, they also have the limitation of not coping with more complex types of classification problems—e.g., hierarchical multi-label classification problems. This thesis investigates the extension of ACO classification algorithms to cope with the aforementioned limitations. Firstly, a method is proposed to extend the rule construction process of ACO classification algorithms to cope with contin- uous attributes directly. Four new ACO classification algorithms are presented, as well as a comparison between them and well-known classification algorithms from the literature. Secondly, an ACO classification algorithm for the hierarchical problem of protein function prediction—which is a major type of bioinformatics problem addressed in this thesis—is presented. Finally, three different approaches to extend ACO classification algorithms to the more complex case of hierarchical multi-label classification are described, elaborating on the ideas of the proposed hierarchical classification ACO algorithm. These algorithms are compare against state-of-the-artdecision treeinduction algorithms forhierarchical multi-label clas- sification in the context of protein function prediction. The computational results of experiments with a wide range of data sets— including challenging protein function prediction data sets with very largenumber ii of class labels—have shown that the proposed ACO classification algorithms are competitive to well-known classification algorithms from the literature, for both conventional (flat single-label) classification andhierarchical multi-label classifica- tion problems. These algorithms address unexplored research areas in the context of ACO classification algorithms to the best of our knowledge, and are therefore original contributions. iii Acknowledgements Firstly, I would like to thank my supervisors Alex Freitas and Colin Johnson for the opportunity of undertaking this work, and for their excellent supervision, support and guidance throughout my PhD. Furthermore, I owe a big thank you to my family, for always being there. A special thanks to my parents for the opportunities and the support they have given me in all my endeavours. I am also very grateful to Carin Tun˚aker for all her help, patience, love and support. Many friends made my time in Canterbury highly enjoyable. I would like to thank Abigail Smith, Beatriz Martinez, Ben Robert, Claire Harris, Cristiano Can- tore, Dolores Castaneda, Eva Barbazan, Gisele Pappa, Javier Valbuena, Mariola Esteban and Miguel Leon-Ledesma for all the time that we spent together. My thanks also go to my friends and colleagues from the School of Computing at the University of Kent, in special to Carlos Silla, Lingfang Du, Mudassar Iqbal, Sebastian Marion and Siddhartha Ghosh. Finally, I would like to gratefully acknowledge the financial support received from the European Union’s INTERREG project (Ref. No. 162/025/361) as well as the continuous financial support received from the School of Computing at the University of Kent, during the course of my PhD. iv Contents Abstract ii Acknowledgements iv List of Tables ix List of Figures xiii List of Algorithms xix 1 Introduction 1 1.1 Original Contributions . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Publication List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Data Mining 9 2.1 Common Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Association Rule Learning . . . . . . . . . . . . . . . . . . 12 2.2 The Conventional (Flat) Classification Task . . . . . . . . . . . . 13 2.2.1 Decision Tree Induction . . . . . . . . . . . . . . . . . . . 14 2.2.2 Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Multi-Label Classification . . . . . . . . . . . . . . . . . . 21 2.3 The Hierarchical Classification Task . . . . . . . . . . . . . . . . . 23 2.3.1 Basic Concepts of Hierarchical Classification . . . . . . . . 24 2.3.2 Hierarchical Multi-Label Classification . . . . . . . . . . . 35 2.4 Evaluation Measures for Classification . . . . . . . . . . . . . . . 36 2.5 Evaluation Measures for Hierarchical Classification . . . . . . . . 37 v 2.5.1 Hierarchical Measures of Precision, Recall and F-measure . 38 2.5.2 Precision-Recall Curves . . . . . . . . . . . . . . . . . . . . 39 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Ant Colony Optimisation 43 3.1 The ACO Metaheuristic . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.1 Problem Representation . . . . . . . . . . . . . . . . . . . 46 3.1.2 Building Solutions . . . . . . . . . . . . . . . . . . . . . . 47 3.1.3 Pheromone Trails . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 ACO applied to Classification: Ant-Miner . . . . . . . . . . . . . 49 3.2.1 Construction Graph . . . . . . . . . . . . . . . . . . . . . 51 3.2.2 Rule Construction . . . . . . . . . . . . . . . . . . . . . . 52 3.2.3 Heuristic Information . . . . . . . . . . . . . . . . . . . . . 54 3.2.4 Rule Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.5 Rule Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.6 Pheromone Trails . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.7 Classifying New Examples . . . . . . . . . . . . . . . . . . 58 3.3 Ant-Miner Extensions . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Bioinformatics 64 4.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Protein Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.1 UniProt Knowledgebase . . . . . . . . . . . . . . . . . . . 72 4.2.2 InterPro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.3 IntAct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Protein Functional Classification Schemes . . . . . . . . . . . . . 73 4.3.1 Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 FunCat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Protein Function Prediction . . . . . . . . . . . . . . . . . . . . . 78 4.4.1 Protein Features as Predictor Attributes . . . . . . . . . . 81 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5 Handling Continuous Attributes in Ant Colony Classification Al- gorithms 86 5.1 Ant-Miner Coping with Continuous Attributes . . . . . . . . . . . 88 vi 5.1.1 Construction Graph . . . . . . . . . . . . . . . . . . . . . 89 5.1.2 Heuristic Information . . . . . . . . . . . . . . . . . . . . . 89 5.1.3 Rule Construction . . . . . . . . . . . . . . . . . . . . . . 92 5.1.4 Pheromone Updating . . . . . . . . . . . . . . . . . . . . . 94 5.2 Minimum Description Length-based Discretisation . . . . . . . . . 94 5.3 Encoding Attribute Interaction as Pheromone Levels: Associating Pheromones with Edges . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Combining Pheromone Associated with Edges and Minimum De- scription Length-based Discretisation . . . . . . . . . . . . . . . . 103 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6 Computational Results for Ant-Miner Coping with Continuous Attributes 105 6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7 Hierarchical and Multi-Label Ant Colony Classification Algo- rithms 121 7.1 Hierarchical Classification Ant-Miner . . . . . . . . . . . . . . . . 122 7.1.1 Construction Graphs . . . . . . . . . . . . . . . . . . . . . 124 7.1.2 Rule Construction . . . . . . . . . . . . . . . . . . . . . . 127 7.1.3 Rule Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 129 7.1.4 Rule Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1.5 Pheromone Trails . . . . . . . . . . . . . . . . . . . . . . . 132 7.1.6 Heuristic Information . . . . . . . . . . . . . . . . . . . . . 133 7.1.7 Using a Rule List to Classify New Examples . . . . . . . . 136 7.2 Coping with Multi-Label Data . . . . . . . . . . . . . . . . . . . . 136 7.2.1 Multi-Label Rule Consequent . . . . . . . . . . . . . . . . 138 7.2.2 Distance-based Heuristic Information . . . . . . . . . . . . 140 7.2.3 Distance-based Discretisation of Continuous Values . . . . 142 7.2.4 Hierarchical Multi-Label Rule Evaluation . . . . . . . . . . 146 7.2.5 Simplified Rule Pruning . . . . . . . . . . . . . . . . . . . 146 7.3 Pittsburgh-based Approach . . . . . . . . . . . . . . . . . . . . . 147 7.3.1 Extended Sequential Covering Strategy . . . . . . . . . . . 149 7.3.2 Updating Pheromone Values Based on the Rule List Quality 153 vii 7.4 A Baseline Approach for Hierarchical Multi-Label Classification with Ant-Miner: Building One Classifier per Class . . . . . . . . . 153 7.4.1 The Baseline Ant Colony Algorithm . . . . . . . . . . . . . 154 7.4.2 Class-specific Heuristic Information . . . . . . . . . . . . . 157 7.4.3 Class-specific Interval Selection for Continuous Attributes 158 7.4.4 Rule Quality Measure . . . . . . . . . . . . . . . . . . . . 159 7.4.5 Classifying New Examples . . . . . . . . . . . . . . . . . . 160 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8 Computational Results for Hierarchical and Multi-Label Ant- Miner 163 8.1 Initial Work on Protein Function Prediction—the hAnt-Miner al- gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.1.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 164 8.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 166 8.1.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 168 8.2 Hierarchical Multi-Label Protein Function Prediction in Yeast . . 170 8.2.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 171 8.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 173 8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9 Conclusions and Future Research 186 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 9.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 References 193 A Software Availability 207 viii List of Tables 2.1 The data set for the artificial ‘weather’ problem [97], where each row corresponds to a different ‘saturday morning’ observation and the ‘Class’ indicates if whether or not (‘P’ or ‘N’, respectively) it is a good day to play tennis. . . . . . . . . . . . . . . . . . . . . . . 10 2.2 An example of a multi-label data set, where each example is asso- ciated with one or more different class labels. In this example, the predictor attributes are omitted and the class attribute has three different values {sports, politics, entertainment}. . . . . . . . . . . 22 4.1 Summary of biological databases publicly available online, mainly containing protein information. . . . . . . . . . . . . . . . . . . . 71 4.2 Main (top-level) FunCat categories. All categories, with the excep- tion of 98 and 99, represent a root category of a tree-structured hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1 Summary of the classification algorithms used in the experiments. Ant-Miner is described in chapter 3; cAnt-Miner variations are de- scribedinchapter 5; J48,JRipandPARTareimplemented inWeka workbench [133] and briefly described in chapter 2. . . . . . . . . 106 6.2 Summary of the data sets used in the experiments. The first col- umn of the table gives the data set abbreviation, the second gives the dataset name, the third and forth columns give the number of nominal and continuous attributes respectively, the fifth column gives the number of class labels, the sixth column gives the number ofexamplesintheoriginaldatasetandtheseventh columngivesthe number of examples in the discrete dataset (after the discretisation of continuous attributes and removal of duplicated examples). . . 107 6.3 Summary of the user-defined parameter values used in Ant-Miner and cAnt-Miner variations in all data sets. . . . . . . . . . . . . . 109 ix 6.4 Predictive accuracy (mean ± standard deviation) obtained with the ten-fold cross-validation procedure in the eighteen data sets by Ant-Miner, cAnt-Miner, cAnt-Miner-MDL and cAnt-Miner2, respectively. The last row of the table indicates the rank-based score—the higher the score, the better the ranking—according to the non-parametric Friedman test [34, 54]. . . . . . . . . . . . . . 111 6.5 Predictive accuracy (mean ± standard deviation) obtained with the ten-fold cross-validation procedure in the eighteen data sets by cAnt-Miner2-MDL, J48, JRip and PART, respectively. The last row of the table indicates the rank-based score—the higher the score, the better the ranking—according to the non-parametric Friedman test [34, 54]. . . . . . . . . . . . . . . . . . . . . . . . . 112 6.6 Model size (mean ± standard deviation) obtained with the ten-fold cross-validation procedure in the eighteen data sets by Ant-Miner, cAnt-Miner, cAnt-Miner-MDL and cAnt-Miner2, respectively. The last row of the table indicates the rank-based score—the lower the score, the better the ranking, since smaller models are preferred— according to the non-parametric Friedman test [34, 54]. . . . . . . 115 6.7 Model size (mean ± standard deviation) obtained with the ten- fold cross-validation procedure in the eighteen data sets by cAnt- Miner2-MDL, J48, JRip and PART, respectively. The last row of the table indicates the rank-based score—the lower the score, the better the ranking, since smaller models are preferred—according to the non-parametric Friedman test [34, 54]. . . . . . . . . . . . . 116 6.8 Summary of the pairwise comparisons in terms of predictive accu- racy amongst Ant-Miner and cAnt-Miner variations conducted in order to evaluate the influence of each individual Ant-Miner’s ex- tensions proposed in chapter 5. For each row, the ‘⊕’ (‘⊖’) symbol indicates that the first algorithm performs better (worse) than the second algorithm, followed by the sum of positive/negative ranks (Score column) and the corresponding p-value, according to the Wilcoxon signed rank test. The significant differences at the 0.01 level are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . 117 x
Description: