Data Analysis for Decision Making Assignment - University Name

Verified

Added on 2023/05/26

AI Summary

This document presents a detailed solution to a data analysis assignment, focusing on key concepts and techniques essential for data-driven decision-making. The assignment covers a range of topics, including descriptive statistics (central tendency and spread), classification and regression methods (binary and multi-class classification, categorical input coding), and clustering techniques (k-means and the gap statistic). Furthermore, the solution addresses marketing applications, such as customer targeting using predictive modeling, exploring the advantages of random forests and decision trees, and interpreting model evaluation metrics (precision, recall). Finally, the assignment includes discussions on overfitting, underfitting, and generalization, providing actionable steps to improve model performance, such as regularization and adjusting network structures. The solution references academic sources and provides clear explanations of the concepts. This assignment is designed to help students understand the practical application of data analysis in various contexts and to enhance their ability to make informed decisions.

DATA ANALYSIS FOR DECISION MAKING 1
Data Analysis for Decision Making
Student Name:
Professor:
Institution Affiliation:
Date:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA ANALYSIS FOR DECISION MAKING 2
1. GENERAL QUESTIONS
1) Data can be described, among other things, in terms of central tendency and spread.
a) Name at least two common measures used for capturing central tendency.
Mean
Median
Mode
b) Give a one sentence definition of at least two of the measures named in a).
Mean. It can be simply defined as the average of numbers. Arithmetically we achieve mean by
adding up all the values in a data set and dividing the sum by the number of values making the
data set.
Median. Given a data set the, data is arranged in ascending order, the middle number separating
the lower half from the higher half in a data set is called the median. It is simply the middle
value.
Mode. Given a data set Mode is the number that is repeated most.
c) Name at least two common measures used for capturing spread.
Range
Quartiles and Interquartile Range
d) A boxplot can be used to visualize the distribution of an attribute. Explain how to interpret a
boxplot.
A boxplot enables scholars to examine the distributional relationship of data set, further it
enables us to study the level of the scores.

DATA ANALYSIS FOR DECISION MAKING 3
.
In the first step, scores are organized. Secondly the sorted data is grouped into four equal
quarters (25% of scores in each subgroup). The four subdivision of the data are referred to as
quartiles scores. The quartile groups are labeled 1 to 4 starting at the lowest.
If a box plot is comparatively short –This means the dataset been analyzed has a great agreement
and is concentrated towards the mean (Thearling, 2017). In case of students exam performance,
the students seems to have scored grades within the same range.
If box plot is comparatively tall – This suggests that the data set under analysis has a great
variation. i.e in case of students exam performance, some may have scored high grades while
others have scored low grades meaning the separation between the two is very wide.
If box plot lower or higher than the other – This means that there exist a variation between the
data set. For instance, the box plot for women may be lower or higher than the men in an election
analysis this could mean more men participated in the elections than women and the vice versa
applies for the higher ended box plot.
When the box plot are unequal in size – Means that similar views are represented in the wider
scale and more variable opinions are held in other parts of the scale which may be narrower
(Shmueli, Bruce, Yahav, Patel, & Lichtendahl, 2017). Further using whiskers to interpret a box
plot, lower longer whisker means that the students average performance is concentrated towards
the lower quartile and the vice versa applies for the longer upper whisker.

DATA ANALYSIS FOR DECISION MAKING 4
2. CLASSIFICATION AND REGRESSION
a) How would you typically code the output (target) in binary (i.e., two-class) and multi-class
classification, respectively? Give the actual coding, what the output layer of the network would
look like, and how you would interpret the outputs, i.e., how the predicted class is determined.
Consider the following training data whose goal is to determine whether a car is manual or
automatic.
Input Hidden output
size 1.0 0.469
0.6model 2.0 0.523
engine 3.0 0.572
0.617
In this example manual is encoded as 1 while automatic is encoded by 0
The output is 0.6 and because this value is closer to 1 the neural network predicts the car as
automatic.
Input Hidden Output
size 1.0 0.469 0.4
0.51
model 2.0 0.523
engine 3.0 0.572
0.617
In this example the manual will be encoded by (1, 0) and automatic (0, 1)
In the output, the larger of the two node values is in the second position and map to (0,1)
hence the neural network predicts the car is automatic.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA ANALYSIS FOR DECISION MAKING 5
b) If we have a categorical but ordered input attribute, let’s say with the possible values {Low,
Medium, High}, how would you code that? Why is this a good coding for that attribute?
I would encode low as 3 medium as 2 and high as 1
This would be appropriate coding because the attributes are ordered in terms of hierarchy with
high indicating better.
3. CLUSTERING
A one-dimensional dataset with ten instances is given below:
{1, 1, 2, 3, 5, 8, 13, 21, 33, 54}
a) Assume that you have to explore a large dataset of high dimensionality and that you
know nothing about the distribution of the data. Describe a method for finding the
number of clusters in the dataset using k-means, and furthermore, explain how k-means
can be applied to find the dimensions in which the clusters separate (i.e. how can you
eliminate dimensions that don't provide any useful information for clustering the dataset,
using k-means).
Gap statistic method will be used to determine the number of clusters in the dataset. The method
will use output of k algorithims to compare change of dispersion in the cluster under null
reference data distribution. According to Thearling, (2017), this method gives more precise
results than the other methods. The formula is given below.
.

DATA ANALYSIS FOR DECISION MAKING 6
 Cluster the observed data, varying the number of clusters from , and compute the
corresponding .
 Generate reference data sets and cluster each of them with varying number of
clusters . Compute the estimated gap
statistic .
 With , compute the standard
deviation and define .
 Choose the number of clusters as the smallest such that .
4. MARKETING
1) You work at an e-retailer selling primarily clothes. Now you would like to use data mining,
more specifically predictive modeling, to select which customers to target with a promotion for a
new line of luxury dresses.
a) Describe how you would model this task, specifically the kind of data set that you would use,
i.e., input variables (attributes) and the output variable (target).
I would break down the input data variables into three major division:
(1) All data related to new line of luxury dresses in terms of prizes quality, sizes, designs and
colors.
(2) All Variables related to the different media houses that my company will for the planned
advertising and promotion activity for the luxury dress.

DATA ANALYSIS FOR DECISION MAKING 7
(3) All Variables related to the budgetary allocation for the advertising and promotion activities
with respect both external advertising cost and internal advertising investment.
I would use the following Output variables for my data allocation.
 The observation of changes in sales volume of the new luxury dress per given period of
time.
 The changes in revenue income resulting from the promotion activity.
 The volume of advertisement made to different media houses during the campaign
period.
4, b) Give the main advantage (one per technique) of using the following two modeling
techniques for the task presented above: i) random forest and ii) decision trees. Are these
properties contradictory, i.e., must we choose one of them, or can we (at least to some
degree) have both?
Random forest
Random forest main advantage is its ability to limit overfitting without increasing errors
due to bias and variance (Shmueli, et al, 2017). Using random forest will allow usage of
many random features of the data used. This will allow usage of many decision trees
instead of just one which is more effective.
Decision tree
The main advantage of using a decision tree would be they are easy and fast to interpret
hence making the visualization process quicker and easier (Roiger, 2017).
Both decision trees and random forest can be used together to some degree depending on
the data. This is because a random forest is a combination of decision trees. When a

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA ANALYSIS FOR DECISION MAKING 8
decision tree gets very deep overfitting can occur recurring random forest to eliminate the
shortcoming.
Question 5
A) How many instances belong to class3? (1E)
5002
b) How many instances have been predicted as class4? (1E)
3265+2=3267
c) What is the precision for class6? (1E)
precision = true positive
true positive + false positive
6/6+7
6/13
Ans= 0.462
d) What is the recall for class6? (1E)
recall=tp/tp+fn
6/6+48+8
=6/62
Ans=0.0968
e) The result dialog below is from another model predicting the same data. Give an example
when the model producing the results below would be preferable over the model producing the
results above. (1C)
This model provides lower accuracy than the previous model. However it maybe preferred in
case there is large class imbalance. This is because it will have better predictive power.
6. For each figure (a, b, c), discuss underfitting/overfitting and generalization, and what steps
you would take to remedy the situation, i.e. do you need to collect more training data, adjust the
structure of the network, regularize the model or maybe just train the model for an additional
number of epochs? Motivate your decision for each scenario (a, b, c).

DATA ANALYSIS FOR DECISION MAKING 9
Figure (C) could be representing appropriate or Generalization of data’. In machine refers
to how accurately the concepts fed into a machine through machine learning model is generally
applicable to other given concepts listed in the model when the machine was learning. A well
generalized model should accurate represent the raw set data and should be flexible to
accommodate new data efficiently (KS, & Kamath, 2017).
Figure (a) could represent data underfitting: According to Lu, Setiono, & Liu (2017), this is a
model that cannot appropriately represent the modeling (training) data and cannot appropriately
apply the concept of generalization when it comes to new additional data. For us to solve the
problem of underfitting we should fit the target variable as the nth degree polynomial resulting to
achievement of general Polynomial. As we increase the polynomial degree the training error will
tend to decrease. Further, the cross validation error will also decrease as we increase the
polynomial degree, forming a convex curve which is more accurate than the latter underfitted
one.
Figure (B) could represent Overfitting: It occurs when a model over represents the training
data to a point that the modelling adversely influences generalization when new data is added to
the model (Ashraf, Ahmad, & Ashraf, 2018). To minimize the limitations of optimal flexibility
of data should be increased in the model. Using any of following approaches.
a) Lasso regularization – we add the ‘P’ term (things used to reduce data overfitting) to our
existing model.
b) Ridege regularization - ‘P’ our regularization data is added to the existing cost function to
reduce the effect of overfitting and allow generalization of new data.

DATA ANALYSIS FOR DECISION MAKING 10
c) Elastic net – we combine the above two methods (a & b) to the existing model to reduce the
effect of overfitting and allow generalization of new data
REFERENCES
Ashraf, N., Ahmad, W., & Ashraf, R. (2018). A Comparative Study of Data Mining Algorithms
for High Detection Rate in Intrusion Detection System.
KS, D., & Kamath, A. (2017). Survey on Techniques of Data Mining and its Applications.
Lu, H., Setiono, R., & Liu, H. (2017). Neurorule: A connectionist approach to data mining. arXiv
preprint arXiv:1701.01358.
Roiger, R. J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining
for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.
Thearling, K. (2017). An introduction to data mining.