Abhisek Acharya Comparative Study of Machine Learning Algo- rithms for Heart Disease Prediction Helsinki Metropolia University of Applied Sciences Bachelor of Engineering Information Technology Thesis 28 April 2017 Abstract Author(s) Abhisek Acharya Title Comparative study of machine learning algorithms for heart disease prediction Number of Pages 43 pages + 4 appendices Date 4 Nov 2016 Degree Bachelor of Engineering Degree Programme Information Technology Specialisation option Software Engineering Instructor(s) Sakari Lukkarinen, Senior Lecturer As technology and hardware capability advance machine learning is also advancing and the use of it is growing in every field from stock analysis to medical image processing. Heart disease prediction is one of the fields where machine learning can be implemented. Therefore, this study investigates the different machine learning algorithms and compares the results using different performance metrics i.e., accuracy, precision, recall, f1-score etc. The dataset used for this study was taken from UCI machine learning repository, titled “Heart Disease Data Set”. This study was executed as a quantitative case study and several previous research on this data set was studied and analysed deeply to understand the subject in greater depth. Statistics and numbers are widely used in the study, so that the study can be quantifiable and several correlations between data attributes can be found. The main objective of this study was to compare the algorithms which can classify the heart disease correctly based on different performance metrics. There are 13 dependent variables in the data set and 1 independent variable to be predicted. The original data set contains predicted variables from 0 to 4 representing a healthy heart starting from 0 to se- verely unhealthy heart at 4. For this study, 0 to 4 class labels were changed to 0 and 1. The predicted class can be either 0 or 1, meaning the heart is either 0 (“Healthy”) or 1 (“Unhealthy”). Techniques such as feature selection, grid search and probability calibration were used to get the optimal results. In this study, algorithms such as k-Nearest Neighbour, Support Vector Machine, Naïve Bayes, Adaboost, Random Forest and Artificial Neural Network are used. It can be con- cluded that Artificial Neural Network and Support Vector Machine are best the algorithms for this data set and possibly other heart disease data sets. For the proper conclusion for this study to be applied clinically, it needs to be further elaborated with the help of experts in both heart and machine leaning domains. Keywords heart, classification, SVC, ANN, Naïve Bayes, training set List of Abbreviations Ada Boost Adaptive Boosting ANN Artificial Neural Network CVD Cardiovascular Disease FP False Positive FN False Negative KNN k-Nearest Neighbour ML Machine Learning NB Naïve Bayes RF Random Forest SBS Sequential Backward Selection SFS Sequential Forward Selection SVC Support Vector Classifier SVM Support Vector Machine TN True Negative TP True Positive UCI University of California Contents 1 Introduction 1 2 Theoretical Background 2 2.1 Overview of Machine Learning 2 2.2 Dimension Reduction Techniques 3 2.2.1 Feature Selection 3 2.2.2 Feature Extraction 5 2.3 Machine Learning Algorithms 6 2.3.1 k-Nearest Neighbour 6 2.3.2 Random Forest 8 2.3.3 Support Vector Machine (SVM) 9 2.3.4 Naïve Bayes 10 2.3.5 Ada Boost 10 2.3.6 Artificial Neural Network 12 2.4 Performance Metrics 13 2.6 Probability Calibration 17 2.7 Data Pre-processing 19 3 Methods and Materials 20 3.1 Research Approach 20 3.2 Research Design 21 3.3 Data Collection and Description 22 3.4 Tools Used 23 3.5 Data Cleaning and Scaling 25 3.6 Data Reduction 25 3.7 Descriptive Statistics and Exploratory Data Analysis 25 4 Results and Discussion 30 4.1 Overview Error! Bookmark not defined. 4.2 Experiments 30 4.2.1 Training Set 30 4.2.2 Test Set 31 4.2.3 Results 31 4.3 Comparison of Results 38 5 Conclusion 39 References 40 Appendices Appendix 1. Sample of Data Appendix 2. Sample of OneHotEncoded Data Appendix 3. Sample code of Naïve Bayes with Feature Selection 1 1 Introduction Cardiovascular disease (CVD) is increasing day by day in this modern world. According to WHO, an estimated 17 million people die of CVDs particularly heart attack and strokes, every year [1]. Thus, it is necessary to enlist the most important symptoms and health habits which contribute towards CVDs. Different tests are conducted before diagnosing CVD, Some of them are auscultation, ECG, blood pressure, cholesterol and blood sugar. These tests are often extensive and time consuming while a patient’s health condition may be critical and he/she needs to start immediate medication so it becomes important to prioritize the tests. There are several health habits which contribute to CVDs. Therefore, it is also necessary to know which of the health habits contribute to CVDs so that healthy health habits can be prac- ticed. Machine learning is an emerging field today due to a rise in the amount of data. Ma- chine learning helps to gain insight from a massive amount of data which is very cum- bersome to humans and sometimes also impossible. The study’s objective is to priori- tize the diagnosis test and see some of the health habits which contribute to CVD. Fur- thermore, and most importantly, different machine learning algorithms are compared according to different performance metrics. In this thesis, manually classified data is used. Manual classification is either healthy or unhealthy. Based on a machine learning technique called classification, 70 percent data are supervised or trained and 30 per- cent are tested in this thesis. Thus, different algorithms are compared as per their pre- diction results. 2 2 Theoretical Background 2.1 Overview of Machine Learning Machine learning is a subfield of computer science and a rapidly up surging topic in today’s context and is expected to boom more in coming days. Our world is flooded with data and data is being created rapidly every day all around the world. According to Big Data and Analytics Solutions company CSC, it is expected by 2020, that the data amount will be 44 times bigger than in 2009 [2]. Therefore, it is necessary to under- stand data and gain insights for better understanding of a human world. The data amount is so huge today that traditional methods cannot be used. Analysing data or building predictive models manually is almost impossible in some scenarios and also time consuming and less productive. Machine learning, on other hand, produces relia- ble, repeatable results and learns from earlier computation. Data used for machine learning are basically of two types labelled data and unlabelled data. Labelled data is the data where attributes are provided. It has some sort of tag or meaning attached to the data therefore used in supervised learning. Labelled attribute can be numerical or categorical. Numerical data are used in regression to predict the value while categorical data are used in classification. Unlabelled data is the data where there are only data points and no labelling to assist. Unlabelled data are used in unsupervised learning so that machine can identify the patterns or any structure pre- sent in the data set. [3, 3] The labelled data and unlabelled data are used with supervised learning and unsuper- vised learning respectively. Supervised learning entails a learning map between a set of input variables X and an output variable Y and applying this mapping to predict the output for unseen data [3]. After learning the dataset, algorithms generalise the data and formulates the hypothetical value H for the given dataset. Supervised learning is further categorised into two types: Regression and Classifica- tion. According to the business dictionary, a regression is a technique for determining the statistical relationship between two or more variables where a change in dependent variable is associated with, and depends on a change in one or more independent var- iables [4]. Classification is a task that occurs very frequently in everyday life. Essential- 3 ly it involves dividing up objects so that each is assigned to one of a number of mutual- ly exhaustive and exclusive categories known as classes. The term ‘mutually exhaus- tive and exclusive’ simply means that each object must be assigned to precisely one class, i.e. never to more than one and never to no class at all [5,5]. Unsupervised learning studies how systems can learn to represent particular input pat- terns in a way that reflects the statistical structure of the overall collection of input pat- terns. By contrast, with supervised learning there are no explicit target outputs or envi- ronmental evaluations associated with each input; rather the unsupervised learner brings prior biases as to what aspects of the structure of the input should be captured in the output. [6, 858] 2.2 Dimension Reduction Techniques Dimensionality reduction is simply the process of decreasing the number of input ran- dom variables without the loss of information. A greater number of input variables or dimensions and large data samples increase the complexity of the dataset. To reduce the memory and computational time dimensionality of data is reduced. Dimensionality reduction also helps to eliminate unnecessary input variables like duplicate variables or variables with a very low significance level. [7, 109-110] There are two types of dimensionality reduction techniques: Feature Selection and Feature Extraction are described below. 2.2.1 Feature Selection In feature selection, k dimensions are selected out of d dimensions that gives most information and discard the (d-k) dimensions. In other words, feature selection is also called subset selection. The best subset contains the least number of dimensions that contribute most to the accuracy. The best subset is found with suitable error function. [7, 110.] 4 Figure 1 Feature Selection. Adapted from Md Rahat Hossian (2013) [8] In Figure 1, Training data is put through a certain process of subset generation, for example sequential backward selection. The resulting subset is now put through the algorithm to test its performance if the performance meets the expected criteria, then it will be selected as the final subset. Otherwise, the resulting subset will again be put through the process of subset generation for more fine-tuning. There are two different approaches to feature selection: Sequential Forward Selection and Sequential Backward Selection which are explained below. Sequential Forward Selection Sequential Forward Selection begins with a model containing no predictors, and then predictors are added to the model, one at a time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional im- provement to the fit is added to the model [9,207]. Let us denote a set by P, with variables Xi, i = 1,……,d. E(P) is the error incurred in the test sample. Sequential Forward Selection starts with empty set with no variables P = {φ}. At each step, a single variable is added to the empty set and a model is trained, 5 and error E(P∪ Xi) is calculated on test set. Error criteria is set as per requirement, for example, the least square error and misclassification error. From all the errors, the in- put variable causing the least error Xj, is selected and added to the empty set P. The model is trained again with remaining number of variables and the process continues to add variables to P, if E(P∪ Xi) is less than E(P). Sequential Backward Selection Sequential Backward Selection is an efficient alternative for best subset solution but it begins with full set of features unlike Sequential Forward Selection. It removes the least significant features one at a time iteratively. [9,207.] Sequential Backward Selection starts with a full set of variables P = {1,2,3,…..,d}. At each step, the model is trained with a full set of variables and the error is calculated in test set, and the variable with highest error Xj, is removed from the set P. The model is trained again with a new set of variables P, and the process continues to remove vari- ables from P, if E(P-Xj) is less than E(P). 2.2.2 Feature Extraction In the feature extraction technique, features or independent variables from the data set are transformed into new independent variables known as new feature space. Newly constructed feature space explains the data most and only significant data are select- ed. Let, there are n features of X …….X . After feature extraction there are m attributes 1 n where (n > m) and this feature extraction is done with some mapping function, F.
Description: