Building and Evaluating Predictive Models: BUS5PA Assignment 1

Verified

Added on 2020/02/24

AI Summary

This assignment, completed for BUS5PA Predictive Analytics at La Trobe Business School, focuses on building and evaluating predictive models. The student utilizes SAS to analyze the ORGANICS dataset, exploring both decision tree and regression-based modeling techniques. The project involves data partitioning, model creation, and assessment using metrics like average squared error and misclassification rate. The student compares the performance of different models, including two decision trees and a regression model, and provides an open-ended discussion about the strengths and weaknesses of each approach. The analysis highlights the importance of variable selection, imputation, and model interpretation for understanding consumer loyalty and developing effective predictive models. The student also extends their knowledge through additional reading, and provides detailed figures and tables to support their analysis and findings.

Building and Evaluating Predictive Models
Assignment 1
(BUS5PA Predictive Analytics – Semester 2, 2017)
By
<Student Name>
(18752031)
La Trobe Business School
Australia

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
1. Setting up the project and exploratory analysis 1
2. Decision tree based modeling and analysis 2
3. Regression based modeling and analysis 3
4. Open ended discussion 5
5. Extending current knowledge with additional reading 6
References 8
Appendix i

List of Figures
Fig. 1 Creation of project: BUS5PA_Assignment1_ 18752031......................................................................................i
Fig. 2 Creation of Library................................................................................................................................................i
Fig. 3 Organics data source............................................................................................................................................ii
Fig. 4 Roles of variables.................................................................................................................................................ii
Fig. 5 Distribution of Organics purchase indicator.......................................................................................................iii
Fig. 6 Organics data source in Organics diagram workspace........................................................................................iii
Fig. 7 Addition of Data partition...................................................................................................................................iv
Fig. 8 Data set Allocations.............................................................................................................................................iv
Fig. 9 Data set Allocations.............................................................................................................................................iv
Fig. 10 Interactive method has not been selected..........................................................................................................iv
Fig. 11 Use average square error as Assessment measure..............................................................................................v
Fig. 12 Subtree Assessment Plot.....................................................................................................................................v
Fig. 13 Decision Tree Model.........................................................................................................................................vi
Fig. 14 Decision Tree after adding Tree 2.....................................................................................................................vi
Fig. 15 Three-way Split.................................................................................................................................................vi
Fig. 16 Assessment Measure for Decision Tree 2.........................................................................................................vi
Fig. 17 Average square error for the model with Tree 2..............................................................................................vii
Fig. 18 StatExplore tool with ORGANICS data source...............................................................................................vii

Fig. 19 Default input method of class and interval variables......................................................................................viii
Fig. 20 Imputation indicators for all imputed inputs...................................................................................................viii
Fig. 21 Addition of Regression node...........................................................................................................................viii
Fig. 22 Model Selection.................................................................................................................................................ix
Fig. 23 Regression Result..............................................................................................................................................ix
Fig. 24 Summary of Stepwise Selection.........................................................................................................................x
Fig. 25 Odd ratio Estimates............................................................................................................................................x
Fig. 26 Average squared error (ASE)............................................................................................................................xi
Fig. 27 Model Comparison Process...............................................................................................................................xi
Fig. 28 Model Comparison Result................................................................................................................................xii
Fig. 29 ROC Chart........................................................................................................................................................xii
Fig. 30 Cumulative Lift...............................................................................................................................................xiii
Fig. 31 Fit Statistics.....................................................................................................................................................xiii
List of Tables
Table 1 Model performance comparison 5

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1. Setting up the project and exploratory analysis
a) A new project has been created, named BUS5PA_Assignment1_18752031, this has been
shown in Fig. 1.
a.1) SAS Library, has been created named Project, and data source has been created using SAS
dataset ORGANICS which has been shown it Fig. 2 and Fig. 3.
a.2) As mentioned in the business case assignment, roles have been set for the analysis
variables., all the roles have also been defined for the data source ORGANICS, which has
been shown in Fig. 4.
a.3) “TargetBuy” has been defined as target variable. In percentage terms, 24.77% individuals
have purchased organic products and rest i.e.75.23% have not purchased organic products.
Percentage distribution has been shown in Fig. 5.
a.4) Demcluster has been set rejected, which has been shown in Fig. 4.
a.5) In Fig. 3, data source named ORGANICS has been defined.
a.6) In Fig. 6, it has been shown that ORGANICS data source has been added to Organics
diagram workspace.
b) TargetAmt can never be used as the predictor of TargetBuy. The individuals have purchased
the organic item or not, that is indicated by TargetBuy, whereas TargetAmt indicates the
number of organic amounts bought. TargetAmt will only be recorded when Targetbuy is Yes
i.e. for those who have purchased any organic products. Hence, in this model, to predict
TargetBuy, TargetAmt cannot be used as an input. The objective of supermarket’s is to
develop a loyalty model by understanding whether customers have purchased any of the
organic products. So, TargetBuy is the perfectly appropriate as target variable.
-1-

2. Decision tree based modeling and analysis
a) From Sample Tab, data partition node has been added to the diagram and it has been
connected to the data source node (ORGANICS). As mentioned in the assignment, 50% of
the data for training and 50% for validation have been assigned (Fig. 7 and Fig. 8.)
b) In Fig. 9, it has been shown that the Decision Tree node has been added to the workspace and
it has been connected to the Data partition node.
c) Decision Tree model has been created autonomously, and sub tree model assessment criteria
has been chosen by using average square error which has been depicted in Fig. 10 and 11.
c.1) Using average square error method, there are 29 leaves in the optimal tree, which has been
shown in Fig. 12.
c.2) For the first split, age variable has been used. It has divided the training data in two parts,
first subset was for the age less than 44.5. In this subset, TargetBuy = 1 has higher than
average concentration. Second subset is for age greater than or equals to 44.5, In this subset,
TargetBuy = 0 has higher than average concentration. Using average square error assessment,
Decision Tree model has been created autonomously, which has been shown in Fig. 13.
d) Second Decision Tree node has been added to the diagram, and it has been connected to the
Data Partition node, which has been depicted in Fig. 14.
d.1) In the Properties panel of the new Decision Tree node, maximum number of branches have
been set to 3 to allow three-way splits, which has been shown in Fig. 15.
d.2) Decision tree model has been created using average square error, which has been depicted
in Fig. 16.
-2-

d.3) There are 33 leaves in the optimal tree, as per average square error. Subtree assessment
plot has been shown in Fig. 17. In C, there were 29 leaves in the optimal tree. With the
decision tree, Tree 2, misclassification rate (Train:0.1848) of the model is very marginally
lower than the model with the decision tree, Tree 1 (Train: 0.1851) and average square error
of the model with the decision tree, Tree 1 (Train: 0.1329) is lower than the model with the
decision tree, Tree 2 (Train: 0.1330). Hence, it can be said that in terms of average square
error, tree with 29 leaves performs marginally better and in terms of misclassification rate,
tree with 33 leaves performs marginally better. However complexly increases with the higher
number of leaves, and tree with lower number of leaves is less complex and more reliable.
e) Based on average square error, the decision tree model which has the smallest average square
error among the actual class and predicted class, i.e. Tree 1 appears better than model with
Tree 2, as Average square error (Tree 1 Model) < Average square error (Tree 2 Model) i.e.
0.1329 < 0.1330. lower average squared error indicates the model performs better as a
predictor because it is “wrong” less often.
3. Regression based modeling and analysis
a) In the Organic diagram, StatExplore tool has been added to the ORGANICS data source and
run that, this has been shown in Fig. 18.
b) Missing value imputation is needed for regression, as in SAS Miner regressions models
ignore observations which contain missing values, that usually reduces the size of the
training data. Less training data can strikingly weaken the predictive power of these models.
In this case, we will use the imputation to overcome the obstruction of missing data, impute
missing values before fit the models are required. It is also required to impute missing values
before fitting a model that overlooks observations with missing values while comparing those
models with a decision tree.
-3-

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

c) For class variables, Default Input Method has been set to Default Constant Value and Default
Character Value has been set to U, and for interval variables Default Input Method is Mean
(shown in Fig. 19).
To create imputation indicators for all imputed inputs, Indicator Variable Type has been set
as Unique, and Indicator Variable Role has been set to Input (shown in Fig. 20)
d) Regression node has been added to the diagram and connected to the Impute node which has
been depicted in Fig. 21.
e) Selection Model has been set as Stepwise, and Selection Criteria has been set to use
Validation Error, which has been shown in Fig. 22.
f) Result of the regression model has been shown in Fig. 23. The selected model, based on the
error rate for the validation data, is the model trained in Step 6. Which consists of the
following variables: IMP_DemAffl, IMP_DemAge, IMP_DemGender, M_DemAffl,
M_DemAge and M_DemGender (i.e. Affluence grade, Age and Gender) (Fig. 24). Hence, for
the supermarket management Affluence grade, age and gender would be main parameters to
understand loyalty of consumers and to formulate a predictive model. The result of odds ratio
estimates shows that important parameter for this model will be imputed value of gender
(Female), gender (Male), Affluence grade and age.
The average squared error (ASE) of prediction is used to estimate the error in prediction of a
model fitted using the training data shown below in Equation 1 (Atkinson, 1980)
ASE= 1
n ∑ ( yi− ^yi )2…………………………………………Equation 1
Here yi is the ith observation in the validation data set and ^yi is its predicted value using the
fitted model and n is the validation sample size. The ASE output SAS has been shown in Fig.
26. ASE for this model is 0.138587 (train data) and 0.137156 (validation data). In a modeling
context, a good predictive model produces values that are close to these ASE values. An
overfit model produces a smaller ASE on the training data but higher values on the validation
and test data. An underfit model exhibits higher values for all data roles.
-4-

4. Open ended discussion
a) Three models, i.e. Decision Tree 1, Decision Tree 2 and Regression model have been
compared using Model Comparison, Fig. 27 and 28 shows the model comparison process and
model comparison result respectively. Tree 1 has been selected in the Fit statistics.
Table 1 Model performance comparison
Statistics Label Tree1 Reg Tree2
Valid: Average Squared Error 0.13
3 0.137 0.133
Valid: Roc Index 0.82
4 0.807 0.824
Valid: Misclassification Rate 0.18
5 0.188 0.188
Valid: Kolmogorov-Smirnov Statistic 0.49
6 0.462 0.496
As per Fit statistics (Fig. 31), Tree 1 is the selected model, valid misclassification rate is lowest
in Tree 1, i.e. 0.185. From Table 1, it can be denoted that, Kolmogorov-Smirnov Statistic and
ROC Index area under the curve are effectively same for both the decision Tree 1 and Tree 2
model, and perform slightly better than regression model. In case of average squared error, Tree
1 and Tree 2 are also effectively same, and performs better than Regression model. Hence, it can
be concluded that Tree 1 is better performer than other two models.
b) After analyzing the three models, decision tree 1 is the best performer for this business case.
The decision tree method is a popular data mining technique (Hastie, Tibshirani, &
Friedman, 2009), which is very easy to use, a robust model with missing data and have better
interpretability. Usually, decision trees are flexible, while regression models are
-5-

comparatively inflexible, for example, for adding additional terms, i.e. interaction terms,
polynomial terms. And decision trees can deal with missing values without any imputation,
whereas regression model usually needs to impute missing values before building a model,
Decision trees are nonparametric and highly robust in nature, while regression models are
parametric and sensitive to influencing points (Berry & Linoff, 1997). Hence, decision trees
are also frequently used in pre-processing phase for a logistic regression.
c) Advantage of Decision Trees are Implicit variable screening and selection – the top nodes of
the tree are the most important variables in the dataset; Less data preparation– data does not
need to be normalized, and decision trees are less sensitive to missing data and outliers;
Decision trees do not require assumptions of linearity; Decision tree output is graphical and
easy to explain i.e. decision based on cut points.
While the regression model estimates relationship among variables, it identifies key patterns
in large data sets and is often used to determine how independent variables are related to the
dependent variables, and to explore the forms of the relationships.
High dimension increases the risk of overfitting due to correlations between redundant
variables, increases computation time, increases the cost and practicality of data collection for
the model, and makes interpretation of the model difficult. For the organic data, misclassification
rate is lowest in decision tree, and The ROC chart window shows that the both decision tree and
regression model have good predictive accuracy. In this case, Decision Trees to consumer
loyalty analysis will be valuable for predictive modeling to understand the consumer segments.
5. Extending current knowledge with additional reading
The supermarket’s objective is to develop loyalty model by whether customers have purchased
any of the organic products. Hence the model needs to be fit in the real world.
-6-

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Just getting things wrong Problem should be identified, without clear objective, model will be
fail. In this business case, there were two target variables, TargetBuy, and TargetAmt.
TargetAmt was just a product of TargetBuy, and TargetAmt was not a binary variable. Hence
selection of target variable is one of the most important thing, the model could go wrong if
TargetAmt be selected as Target Variable.
Overfitting For the training data, when model becomes more complex, with more leaves of the
decision tree, due to more iterations of training for a neural network, it appears to be fit the
training data. But, in actual scenario, it fits noise as well as signal. In this business case, the
decision tree, Tree 2, misclassification rate (Train:0.1848) of the model is very marginally lower
than the model with the decision tree, Tree 1 (Train: 0.1851) and average square error of the
model with the decision tree, Tree 1 (Train: 0.1329) is lower than the model with the decision
tree, Tree 2 (Train: 0.1330). Hence it can be said that tree with 29 leaves performs marginally
better in terms of average square error and tree with 33 leaves performs marginally better in
terms of misclassification rate. But as with the higher number of leaves complexity increases, a
less complex and reliable tree i.e. Tree 1 will be more suitable for the model.
Sample bias For this analysis, this sample covers across 5 geographical regions and 13 television
regions across the world. Hence, there will be different set of consumers and sample bias will not
be present in the data.
Future not being like the past In this business case, the model has been created using the past
data of consumers of super market, it will not always be true, that the consumer who purchased
the product will buy in the future. Various extraneous may affect the loyalty and consumer’s
purchase.
-7-

References
1. Berry, M.J. and Linoff, G., 1997. Data mining techniques: for marketing, sales, and
customer support. John Wiley & Sons, Inc.
2. Hastie, T., Tibshirani, R. and Friedman, J., 2009. Overview of supervised learning. In
The elements of statistical learning (pp. 9-41). Springer New York.
-8-