ebook img

ggRandomForests: Visually Exploring a Random Forest for Regression PDF

1.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ggRandomForests: Visually Exploring a Random Forest for Regression

ggRandomForests: Random Forests for Regression John Ehrlinger Cleveland Clinic Abstract RandomForests(Breiman2001)(RF)areanon-parametricstatisticalmethodrequir- ing no distributional assumptions on covariate relation to the response. RF are a robust, 5 nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to 1 stabilizemodelestimates. TherandomForestSRCpackage(IshwaranandKogalur2014)is 0 a unified treatment of Breiman’s random forests for survival, regression and classification 2 problems. b Predictive accuracy make RF an attractive alternative to parametric models, though e complexity and interpretability of the forest hinder wider application of the method. F We introduce the ggRandomForests package, tools for visually understand random for- 3 est models grown in R (R Core Team 2014) with the randomForestSRC package. The 1 ggRandomForests package is structured to extract intermediate data objects from ran- ] domForestSRC objects and generates figures using the ggplot2 (Wickham 2009) graphics O package. C This document is structured as a tutorial for building random forests for regression . with the randomForestSRC package and using the ggRandomForests package for inves- t a tigating how the forest is constructed. We investigate the Boston Housing data (Har- t s rison and Rubinfeld 1978; Belsley, Kuh, and Welsch 1980). We demonstrate random [ forest variable selection using Variable Importance (VIMP) (Breiman 2001) and Mini- mal Depth (Ishwaran, Kogalur, Gorodeski, Minn, and Lauer 2010), a property derived 2 v from the construction of each tree within the forest. We will also demonstrate the use 6 of variable dependence plots (Friedman 2000) to aid interpretation RF results. We then 9 examine variable interactions between covariates using a minimal depth interactions, and 1 conditional variable dependence plots. The goal of the exercise is to demonstrate the 7 strength of using Random Forest methods for both prediction and information retrieval 0 . in regression settings. 1 0 5 Keywords: random forest, regression, VIMP, minimal depth, R, randomForestSRC. 1 : v i X r a About this document This document is a package vignette for the ggRandomForests package for “Visually Ex- ploring Random Forests”(http://CRAN.R-project.org/package=ggRandomForests). The ggRandomForests package is designed for use with the randomForestSRC package (Ishwaran and Kogalur 2014, http://CRAN.R-project.org/package=randomForestSRC) for growing random forests for survival (time to event response), regression (continuous response) and classification (categorical response) settings and uses the ggplot2 package (Wickham 2009, http://CRAN.R-project.org/package=ggplot2)forplottingdiagnosticandvariableassoci- ation results. ggRandomForests is structured to extract data objects from randomForestSRC 2 Random Forests for Regression objects and provides functions for printing and plotting these objects. ThevignetteisatutorialforusingtheggRandomForestspackagewiththerandomForestSRC package for building and post-processing random forests for regression settings. In this tuto- rial, we explore a random forest for regression model constructed for the Boston housing data set (Harrison and Rubinfeld 1978; Belsley et al. 1980), available in the MASS package (Ven- ables and Ripley 2002). We grow a random forest and demonstrate how ggRandomForests can be used when determining how the response depends on predictive variables within the model. The tutorial demonstrates the design and usage of many of ggRandomForests func- tionsandfeaturesandalsohowtomodifyandcustomizetheresultingggplotgraphicobjects along the way. The vignette is written in LATEXusing the knitr package (Xie 2015, 2014, 2013, http://CRAN. R-project.org/package=knitr), which facilitates weaving R (R Core Team 2014) code, results and figures into document text. Throughout this document, R code will be displayed in code blocks as shown below. This code block loads the R packages required to run the source code listed in code blocks throughout the remainder of this document. R> ################## Load packages ################## R> library("ggplot2") # Graphics engine R> library("RColorBrewer") # Nice color palettes R> library("plot3D") # for 3d surfaces. R> library("dplyr") # Better data manipulations R> library("parallel") # mclapply for multicore processing R> R> # Analysis packages. R> library("randomForestSRC") # random forests for survival, regression and R> # classification R> library("ggRandomForests") # ggplot2 random forest figures (This!) R> R> ################ Default Settings ################## R> theme_set(theme_bw()) # A ggplot2 theme with white background R> R> ## Set open circle for censored, and x for events R> event.marks <- c(1, 4) R> event.labels <- c(FALSE, TRUE) R> R> ## We want red for death events, so reorder this set. R> strCol <- brewer.pal(3, "Set1")[c(2,1,3)] This vignette is available within the ggRandomForests package on the Comprehensive R Archive Network (CRAN) (R Core Team 2014, http://cran.r-project.org). Once the package has been installed, the vignette can be viewed directly from within R with the fol- lowing command: R> vignette("randomForestSRC-Regression", package = "ggRandomForests") A development version of the ggRandomForests package is also available on GitHub (https: //github.com). We invite comments, feature requests and bug reports for this package at https://github.com/ehrlinger/ggRandomForests. Ehrlinger 3 1. Introduction Random Forests (Breiman 2001) (RF) are a fully non-parametric statistical method which requires no distributional or functional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random Survival Forests (RSF) (Ishwaran and Kogalur 2007; Ishwaran, Kogalur, Blackstone, and Lauer 2008) are an extension of Breiman’s RF techniques to survival settings, allowing efficient non-parametric analysis of time to event data. The randomForestSRC package (Ishwaran and Kogalur 2014) is a unified treatment of Breiman’s random forests for survival (time to event response), regression (continuous response) and classification (categorical response) problems. PredictiveaccuracymakeRFanattractivealternativetoparametricmodels,thoughcomplex- ity and interpretability of the forest hinder wider application of the method. We introduce the ggRandomForests package for visually exploring random forest models. The ggRandom- Forests package is structured to extract intermediate data objects from randomForestSRC objects and generate figures using the ggplot2 graphics package (Wickham 2009). Many of the figures created by the ggRandomForests package are also available directly from within the randomForestSRC package. However ggRandomForests offers the following ad- vantages: • Separation of data and figures: ggRandomForests contains functions that operate on ei- thertherandomForestSRC::rfsrcforestobjectdirectly,orontheoutputfromrandom- ForestSRCpostprocessingfunctions(i.e. plot.variable,var.select,find.interaction) to generate intermediate ggRandomForests data objects. functions are provide to fur- ther process these objects and plot results using the ggplot2 graphics package. Alter- natively, users can use these data objects for their own custom plotting or analysis operations. • Each data object/figure is a single, self contained object. This allows simple modifica- tion and manipulation of the data or ggplot2 objects to meet users specific needs and requirements. • The use of ggplot2 for plotting. We chose to use the ggplot2 package for our figures to allow users flexibility in modifying the figures to their liking. Each plot function returns either a single ggplot object, or a list of ggplot objects, allowing users to use additional ggplot2 functions or themes to modify and customize the figures to their liking. ThisdocumentisformattedasatutorialforusingtherandomForestSRCpackageforbuilding and post-processing random forest models with the ggRandomForests package for investigat- inghowtheforestisconstructed. Inthistutorial,weusetheBostonHousingData(Section2), available in the MASS package (Venables and Ripley 2002), to build a random forest for re- gression(Section3)anddemonstratethetoolsintheggRandomForestspackageforexamining the forest construction. Random forests are not parsimonious, but use all variables available in the construction of a response predictor. We demonstrate a random forest variable selection (Section 4) process using the Variable Importance (Section 4.1) measure (VIMP) (Breiman 2001) as well as 4 Random Forests for Regression Minimal Depth (Section 4.2) (Ishwaran et al. 2010), a property derived from the construction of each tree within the forest, to assess the impact of variables on forest prediction. Oncewehaveanideaofwhichvariableswearewanttoinvestigatefurther,wewillusevariable dependence plots (Friedman 2000) to understand how a variable is related to the response (Section 5). Marginal dependence plots (Section 5.1) give us an idea of the overall trend of a variable/response relation, while partial dependence plots (Section 5.2) show us a risk adjustedrelation. Thesefiguresmayshowstronglynon-linearvariable/responserelationsthat are not easily obtained through a parametric approach. We are also interested in examining variable interactions within the forest model (Section 6). Using a minimal depth approach, we can quantify how closely variables are related within the forest, and generate marginal dependence(Section7)andpartialdependence(Section!7.1)(riskadjusted)conditioningplots (coplots) (Chambers 1992; Cleveland 1993) to examine these interactions graphically. 2. Data: Boston Housing Values The Boston Housing data is a standard benchmark data set for regression models. It contains data for 506 census tracts of Boston from the 1970 census (Harrison and Rubinfeld 1978; Belsley et al. 1980). The data is available in multiple R packages, but to keep the installation dependencies for the ggRandomForests package down, we will use the data contained in the MASS package (Venables and Ripley 2002, http://CRAN.R-project.org/package=MASS), available with the base install of R. The following code block loads the data into the environ- ment. We include a table of the Boston data set variable names, types and descriptions for reference when we interpret the model results. R> # Load the Boston Housing data R> data(Boston, package="MASS") R> R> # Set modes correctly. For binary variables: transform to logical R> Boston$chas <- as.logical(Boston$chas) The main objective of the Boston Housing data is to investigate variables associated with predicting the median value of homes (continuous medv response) within 506 suburban areas of Boston. 2.1. Exploratory Data Analysis It is good practice to view your data before beginning an analysis, what Tukey (1977) refers to as Exploratory Data Analysis (EDA). To facilitate this, we use ggplot2 figures with the ggplot2::facet_wrap command to create two sets of panel plots, one for cate- gorical variables with boxplots at each level, and one of scatter plots for continuous vari- ables. Each variable is plotted along a selected continuous variable on the X-axis. These figures help to find outliers, missing values and other data anomalies in each variable be- fore getting deep into the analysis. We have also created a separate shiny app (Chang, Cheng, Allaire, Xie, and McPherson 2015, http://shiny.rstudio.com), available at https: //ehrlinger.shinyapps.io/xportEDA, for creating similar figures with an arbitrary data set, to make the EDA process easier for users. Ehrlinger 5 Variable Description type crim Crime rate by town. numeric zn Proportion of residential land zoned for lots over 25,000 sq.ft. numeric indus Proportion of non-retail business acres per town. numeric chas Charles River (tract bounds river). logical nox Nitrogen oxides concentration (10 ppm). numeric rm Number of rooms per dwelling. numeric age Proportion of units built prior to 1940. numeric dis Distances to Boston employment center. numeric rad Accessibility to highways. integer tax Property-tax rate per $10,000. numeric ptratio Pupil-teacher ratio by town. numeric black Proportion of blacks by town. numeric lstat Lower status of the population (percent). numeric medv Median value of homes ($1000s). numeric Table 1: Boston housing data dictionary. The Boston housing data consists almost entirely of continuous variables, with the exception of the“Charles river”logical variable. A simple EDA visualization to use for this data is a single panel plot of the continuous variables, with observation points colored by the logical variable. Missing values in our continuous variable plots are indicated by the rug marks along the x-axis, of which there are none in this data. We used the Boston housing response variable, the median value of homes (medv), for X variable. R> # Use reshape2::melt to transform the data into long format. R> dta <- melt(Boston, id.vars=c("medv","chas")) R> R> # plot panels for each covariate colored by the logical chas variable. R> ggplot(dta, aes(x=medv, y=value, color=chas))+ + geom_point(alpha=.4)+ + geom_rug(data=dta %>% filter(is.na(value)))+ + labs(y="", x=st.labs["medv"]) + + scale_color_brewer(palette="Set2")+ + facet_wrap(~variable, scales="free_y", ncol=3) This figure is loosely related to a pairs scatter plot (Becker, Chambers, and Wilks 1988), but inthiscaseweonlyexaminetherelationbetweentheresponsevariableagainsttheremainder. Plotting the data against the response also gives us a ”sanity check”when viewing our model results. It’s pretty obvious from this figure that we should find a strong relation between median home values and the lstat and rm variables. 3. Random Forest - Regression 6 Random Forests for Regression crim zn indus 100 75 75 20 50 50 10 25 25 0 0 0 nox rm age 9 100 0.8 8 75 0.7 7 0.6 6 50 0.5 5 25 0.4 4 0 chas FALSE dis rad tax 12.5 25 700 TRUE 10.0 20 600 7.5 15 500 5.0 10 400 2.5 5 300 200 0 ptratio black lstat 400 20.0 300 30 17.5 200 20 15.0 100 10 12.5 0 0 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Median value of homes ($1000s). Figure 1: EDA variable plots. Points indicate variable value against the median home value variable. Points are colored according to the chas variable. A Random Forest is grown by bagging (Breiman 1996a) a collection of classification and regression trees (CART) (Breiman, Friedman, Olshen, and Stone 1984). The method uses a set of B bootstrap (Efron and Tibshirani 1994) samples, growing an independent tree model on each sub-sample of the population. Each tree is grown by recursively partitioning the population based on optimization of a split rule over the p-dimensional covariate space. At each split, a subset of m ≤ p candidate variables are tested for the split rule optimization, dividing each node into two daughter nodes. Each daughter node is then split again until the process reaches the stopping criteria of either node purity or node member size, which defines the set of terminal (unsplit) nodes for the tree. In regression trees, the split rule is based on minimizing the mean squared error, whereas in classification problems, the Gini index is used (Friedman 2000). Random Forests sort each training set observation into one unique terminal node per tree. Treeestimatesforeachobservationareconstructedateachterminalnode,amongtheterminal node members. The Random Forest estimate for each observation is then calculated by aggregating, averaging (regression) or votes (classification), the terminal node results across the collection of B trees. For this tutorial, we grow the random forest for regression using the rfsrc command to predict the median home value (medv variable) using the remaining 13 independent predictor variables. For this example we will use the default set of B = 1000 trees (ntree argument), Ehrlinger 7 m = 5candidatevariables(mtry)foreachsplitwithastoppingcriteriaofatmostnodesize=5 observations within each terminal node. Because growing random forests are computationally expensive, and the ggRandomForests package is targeted at the visualization of random forest objects, we will use cached copies of the randomForestSRC objects throughout this document. We include the cached objects as data sets in the ggRandomForests package. The actual rfsrc calls are included in comments within code blocks. R> # Load the data, from the call: R> # rfsrc_Boston <- rfsrc(medv~., data=Boston) R> data(rfsrc_Boston) R> R> # print the forest summary R> rfsrc_Boston Sample size: 506 Number of trees: 1000 Minimum terminal node size: 5 Average no. of terminal nodes: 79.911 No. of variables tried at each split: 5 Total no. of variables: 13 Analysis: RF-R Family: regr Splitting rule: regr % variance explained: 85.88 Error rate: 11.94 The randomForestSRC::print.rfsrc summary details the parameters used for the rfsrc call described above, and returns variance and generalization error estimate from the forest training set. The forest is built from 506 observations and 13 independent variables. It was constructed for the continuous medv variable using ntree=1000 regression (regr) trees, randomly selecting 5 candidate variables at each node split, and terminating nodes with no fewer than 5 observations. 3.1. Generalization error estimates One advantage of Random Forests is a built in generalization error estimate. Each bootstrap sample selects approximately 63.2% of the population on average. The remaining 36.8% of observations, the Out-of-Bag (OOB) (Breiman 1996b) sample, can be used as a hold out test set for each of the trees in the forest. An OOB prediction error estimate can be calculated for each observation by predicting the response over the set of trees which were NOT trained with that particular observation. The Out-of-Bag prediction error estimates have been shown to be nearly identical to n–fold cross validation estimates (Hastie, Tibshirani, and Friedman 2009). This feature of Random Forests allows us to obtain both model fit and validation in one pass of the algorithm. The gg_error function operates on the randomForestSRC::rfsrc object to extract the er- ror estimates as the forest is grown. The code block demonstrates part the ggRandom- 8 Random Forests for Regression Forests design philosophy, to create separate data objects and provide functions to operate on the data objects. The following code block first creates a gg_error object, then uses the plot.gg_error function to create a ggplot object for display. R> # Plot the OOB errors against the growth of the forest. R> gg_e <- gg_error(rfsrc_Boston) R> plot(gg_e) 21 e t a R r o18 r r E B O 15 O 12 0 250 500 750 1000 Number of Trees Figure 2: Random forest generalization error. OOB error convergence along the number of trees in the forest. This figure demonstrates that it does not take a large number of trees to stabilize the forest prediction error estimate. However, to ensure that each variable has enough of a chance to be included in the forest prediction process, we do want to create a rather large random forest of trees. 3.2. Random Forest Prediction The gg_rfsrc function extracts the OOB prediction estimates from the random forest. This codeblockexecutesthethedataextractionandplottinginoneline,sincewearenotinterested in holding the prediction estimates for later reuse. Also note that we add in the additional ggplot2 command (coord_cartesian) to modify the plot object. Each of the ggRandom- Forests plot commands return ggplot objects, which we can also store for modification or reuse later in the analysis. R> # Plot predicted median home values. R> plot(gg_rfsrc(rfsrc_Boston), alpha=.5)+ + coord_cartesian(ylim=c(5,49)) The gg_rfsrc plot shows the predicted median home value, one point for each observation in the training set. The points are jittered around a single point on the x-axis, since we are Ehrlinger 9 40 e u al V 30 d e t c di e20 r P 10 medv Figure3: OOBpredictedmedianhomevalues. Pointsarejitteredtohelpvisualizepredictions for each observation. Boxplot indicates the distribution of the predicted values. only looking at predicted values from the forest. These estimates are Out of Bag, which are analogous to test set estimates. The boxplot is shown to give an indication of the distribution of the prediction estimates. For this analysis the figure is another model sanity check, as we are more interested in exploring the“why”questions for these predictions. 4. Variable Selection Random forests are not parsimonious, but use all variables available in the construction of a response predictor. Also, unlike parametric models, Random Forests do not require the explicit specification of the functional form of covariates to the response. Therefore there is noexplicitp-value/significancetestforvariableselectionwitharandomforestmodel. Instead, RF ascertain which variables contribute to the prediction through the split rule optimization, optimally choosing variables which separate observations. We use two separate approaches to explore the RF selection process, Variable Importance (Section 4.1) and Minimal Depth (Section 4.2). 4.1. Variable Importance. Variable importance (VIMP)wasoriginallydefinedinCARTusingameasureinvolvingsurro- gatevariables(seeChapter5ofBreimanet al.(1984)). ThemostpopularVIMPmethoduses a prediction error approach involving“noising-up”each variable in turn. VIMP for a variable x is the difference between prediction error when x is noised up by randomly permuting v v its values, compared to prediction error under the observed values (Breiman 2001; Liaw and Wiener 2002; Ishwaran 2007; Ishwaran et al. 2008). Since VIMP is the difference between OOB prediction error before and after permutation, a large VIMP value indicates that misspecification detracts from the variable predictive accu- 10 Random Forests for Regression racy in the forest. VIMP close to zero indicates the variable contributes nothing to predictive accuracy, and negative values indicate the predictive accuracy improves when the variable is mispecified. In the later case, we assume noise is more informative than the true variable. As such, weignorevariableswithnegativeandnearzerovaluesofVIMP,relyingonlargepositive values to indicate that the predictive power of the forest is dependent on those variables. The gg_vimp function extracts VIMP measures for each of the variables used to grow the forest. Theplot.gg_vimpfunctionshows the variables, inVIMP rankorder, from the largest (Lower Status) at the top, to smallest (Charles River) at the bottom. VIMP measures are shown using bars to compare the scale of the error increase under permutation. R> # Plot the VIMP rankings of independent variables. R> plot(gg_vimp(rfsrc_Boston), lbls=st.labs) VIMP Lower status of the population (percent). Number of rooms per dwelling. Nitrogen oxides concentration (10 ppm). Pupil−teacher ratio by town. Proportion of non−retail business acres per town. Crime rate by town. Distances to Boston employment center. Proportion of units built prior to 1940. Property−tax rate per $10,000. Accessibility to highways. Proportion of blacks by town. Proportion of residential land zoned for lots over 25,000 sq.ft. Charles River (tract bounds river). 0 20 40 Variable Importance Figure 4: Random forest VIMP plot. Bars are colored by sign of VIMP, longer blue bars indicate more important variables. For our random forest, the top two variables (lstat and rm) have the largest VIMP, with a sizable difference to the remaining variables, which mostly have similar VIMP measure. This indicates we should focus attention on these two variables, at least, over the others. In this example, all VIMP measures are positive, though some are small. When there are both negative and positive VIMP values, the plot.gg_vimp function will color VIMP by the sign of the measure. We use the lbls argument to pass a named vector of meaningful text

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.