Table Of Content

Data Mining Using SAS Enterprise ® Miner : A Case  Study Approach, Second Edition The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2003. Data Mining Using SAS® Enterprise MinerTM: A Case Study Approach, Second Edition.Cary, NC: SAS Institute Inc. Data Mining Using SAS® Enterprise MinerTM: A Case Study Approach, Second Edition Copyright © 2003, SAS Institute Inc., Cary, NC, USA ISBN 1-59047-395-7 All rights reserved. Produced in the United States of America. Your use of this e-book shall be governed by the terms established by the vendor at the time you acquire this e-book. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, April 2003 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. SAS®and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Contents Chapter 1 Introduction to SAS Enterprise Miner 1 Starting Enterprise Miner 1 Setting Up the Initial Project and Diagram 2 Identifying the Interface Components 3 Data Mining and SEMMA 4 Accessing SAS Data through SAS Libraries 16 Chapter 2 Predictive Modeling 19 Problem Formulation 20 Creating a Process Flow Diagram 21 Data Preparation and Investigation 34 Fitting and Comparing Candidate Models 58 Generating and Using Scoring Code 72 Generating a Report Using the Reporter Node 80 Chapter 3 Variable Selection 83 Introduction to Variable Selection 83 Using the Variable Selection Node 84 Chapter 4 Clustering Tools 91 Problem Formulation 91 Overview of Clustering Methods 92 Chapter 5 Association Analysis 105 Problem Formulation 105 Chapter 6 Link Analysis 111 Problem Formulation 111 Examining Web Log Data 111 Appendix 1 Recommended Reading 121 Recommended Reading 121 Index 123 iv 1 C H A P T E R 1 Introduction to SAS Enterprise Miner StartingEnterpriseMiner 1 SettingUptheInitialProjectandDiagram 2 IdentifyingtheInterfaceComponents 3 DataMiningandSEMMA 4 DefinitionofDataMining 4 OverviewoftheData 4 PredictiveandDescriptiveTechniques 5 OverviewofSEMMA 5 OverviewoftheNodes 6 SampleNodes 6 ExploreNodes 7 ModifyNodes 9 ModelNodes 11 AssessNodes 13 ScoringNodes 14 UtilityNodes 14 SomeGeneralUsageRulesforNodes 15 AccessingSASDatathroughSASLibraries 16 Starting Enterprise Miner To start Enterprise Miner, start SAS and then type miner on the SAS command bar. Submit the command by pressing the Return key or by clicking the check mark icon next to the command bar. Alternatively, select from the main menu Solutions Analysis Enterprise Miner For more information, see Getting Started with SAS Enterprise Miner. 2 SettingUptheInitialProjectandDiagram Chapter1 Setting Up the Initial Project and Diagram Enterprise Miner organizes data analyses into projects and diagrams. Each project may have several process flow diagrams, and each diagram may contain several analyses. Typically each diagram contains an analysis of one data set. Follow these steps to create a project. 1 From the SAS menu bar, select File New Project 2 Type a name for the project, such as My Project. 3 Select the Client/server project check box if necessary. Note: You must have the access to a server that runs the same version of Enterprise Miner. For information about building a client/server project, see Getting Started with SAS Enterprise Miner or the online Help. (cid:0) 4 Modify the location of the project folder by either typing a different location in the Location field or by clicking Browse . 5 Click Create . The project opens with an initial untitled diagram. 6 Select the diagram title and type a new name, such as My First Flow. IdentifyingtheInterfaceComponents 3 Identifying the Interface Components The SAS Enterprise Miner window contains the following interface components: (cid:0) Project Navigator — enables you to manage projects and diagrams, add tools to the Diagram Workspace, and view HTML reports that are created by the Reporter node. Note that when a tool is added to the Diagram Workspace, the tool is referred to as a node. The Project Navigator has three tabs: (cid:0) Diagrams tab — lists the current project and the diagrams within the project. By default, the project window opens with the Diagrams tab activated. (cid:0) Tools tab — contains the Enterprise Miner tools palette. This tab enables you to see all of the tools (or nodes) that are available in Enterprise Miner. The tools are grouped according to the SEMMA data-mining methodology. Many of the commonly used tools are shown on the Tools Bar at the top of the window. You can add additional tools to the Tools Bar by dragging them from the Tools tab ontothe Tools Bar. In addition, you can rearrange the tools on the Tools Bar by dragging each tool to a new location on the Tools Bar. (cid:0) Reports tab — displays the HTML reports that are generated by using the Reporter node. (cid:0) Diagram Workspace — enables you to build, edit, run, and save process flow diagrams. (cid:0) Tools Bar — contains a customizable subset of Enterprise Miner tools that are commonly used to build process flow diagrams in the Diagram Workspace. You can add or delete tools from the Tools Bar. (cid:0) Progress Indicator — displays a progress indicator bar that indicates the execution status of an Enterprise Miner task. (cid:0) Message Panel — displays messages about the execution of an Enterprise Miner task. (cid:0) Connection Status Indicator — displays the remote host name and indicates whether the connection is active for a client/server project. 4 DataMiningandSEMMA Chapter1 Data Mining and SEMMA Definition of Data Mining This document defines data mining as advanced methods for exploring and modeling relationships in large amounts of data. Overview of the Data Your data often comes from several different sources, and combining information from these different sources may present quite a challenge. The need for better and quicker access to information has generated a great deal of interest in building data warehouses that are able to quickly assemble and deliver the needed information in usable form. To download documentation that discusses the Enterprise Miner add-ins to SAS/Warehouse Administrator, go to the SAS Customer Support Center Web site (http://support.sas.com). From Software Downloads, select Product and Solution Updates. From the Demos and Downloads page, select SAS/Warehouse Administrator Software, and download the version that you want. A typical data set has many thousand observations. An observation may represent an entity such as an individual customer, a specific transaction, or a certain household. Variables in the data set contain specific information such as demographic information, sales history, or financial information for each observation. How this information is used depends on the research question of interest. When talking about types of data, consider the measurement level of each variable. You can generally classify each variable as one of the following: (cid:0) interval — a variable for which the mean (or average) makes sense, such as average income or average temperature. (cid:0) categorical — a variable consisting of a set of levels, such as gender (male or female) or drink size (small, regular, large). In general, if the variable is not continuous (that is, if taking the average does not make sense, such as average gender), then it is categorical. Categorical data can be grouped in several ways. For the purposes of Enterprise Miner, consider these subgroupings of categorical variables: (cid:0) unary — a variable that has the same value for every observation in the data set. (cid:0) binary — a variable that has only two possible levels. Gender is an example. (cid:0) nominal — a variable that has more than two levels, but the values of the levels have no implied order. Pie flavors such as cherry, apple, and peach are examples. (cid:0) ordinal — a variable that has more than two levels, and the values of the levels have an implied order. Drink sizes such as small, regular, and large are examples. Note: Ordinal variables may be treated as nominal variables, if you are not interested in the ordering of the levels. However, nominal variables cannot be treated as ordinal variables since there is no implied ordering by definition. (cid:0) Missing values are not included in the counts. To obtain a meaningful analysis, you must construct an appropriate data set and specify the correct measurement level for each of the variables. OverviewofSEMMA 5 Predictive and Descriptive Techniques Predictive modeling techniques enable you to identify whether a set of input variables is useful in predicting some outcome variable. For example, a financial institution may try to determine if knowledge of an applicant’s income and credit history (input variables) helps to predict whether the client is likely to default on a loan (outcome variable). To distinguish the input variables from the outcome variables, set the model role for each variable in the data set. Identify outcome variables by using the target model role, and identify input variables by using the input model role. Examples of model roles include cost, freq, ID, and input. If you want to exclude some of the variables from the analysis, identify these variables by using the rejected model role. Specify a variable as an ID variable by using the ID model role. Predictive modeling techniques require one or more outcome variables of interest. Each technique attempts to predict the outcome as well as possible according to some criteria such as maximizing accuracy or maximizing profit. This document shows you how to use several predictive modeling techniques through Enterprise Miner including regression models, decision trees, and neural networks. Each of these techniques enables you to predict a binary, nominal, ordinal, or continuous outcome variable from any combination of input variables. Descriptive techniques enable you to identify underlying patterns in a data set. These techniques do not have a specific outcome variable of interest. This document explores how to use Enterprise Miner to perform the following descriptive analyses: (cid:0) Cluster analysis: This analysis attempts to find natural groupings of observations in the data, based on a set of input variables. After grouping the observations into clusters, you can use the input variables to try to characterize each group. When the clusters have been identified and interpreted, you can decide whether to treat each cluster independently. (cid:0) Association analysis: This analysis identifies groupings of products or services that tend to be purchased at the same time or at different times by the same customer. The analysis answers questions such as (cid:0) What proportion of the people who purchased eggs and milk also purchased bread? (cid:0) What proportion of the people who have a car loan with some financial institution later obtain a home mortgage from the same institution? Overview of SEMMA Enterprise Miner nodes are arranged into the following categories according the SAS process for data mining: SEMMA. (cid:0) Sample — identify input data sets (identify input data; sample from a larger data set; partition data set into training, validation, and test data sets). (cid:0) Explore — explore data sets statistically and graphically (plot the data, obtain descriptive statistics, identify important variables, perform association analysis). (cid:0) Modify — prepare the data for analysis (create additional variables or transform existing variables for analysis, identify outliers, replace missing values, modify the way in which variables are used for the analysis, perform cluster analysis, analyze data with self-organizing maps (known as SOMs) or Kohonen networks). (cid:0) Model — fit a predictive model (model a target variable by using a regression model, a decision tree, a neural network, or a user-defined model).