Business Analytics: Classification, Regression, and Data Analysis

Verified

Added on 2021/02/21

AI Summary

This business analytics assignment solution delves into various aspects of data analysis and modeling. It begins with an exploration of classification methods, including the confusion matrix and practical applications of decision trees and K-Nearest Neighbor techniques. The document then examines oversampling partitioning strategies and the use of explanatory and categorical variables in logistic regression. Quantitative questions are addressed, focusing on KNN model predictions, building predictive models for repair time, analyzing datasets, and providing recommendations for data improvements. The solution also covers techniques for handling missing values in Excel, calculating protein levels by blood type, and presenting data visualizations. Furthermore, the assignment tackles investment analysis, including calculating returns and determining investment percentages for financial goals. Finally, linear programming is explored, with a focus on model writing, result interpretation, and incorporating discounting strategies.

Business Analytics

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

TABLE OF CONTENTS
INTRODUCTION...........................................................................................................................4
SECTION A: DISCUSSION QUESTIONS....................................................................................4
1. Explaining Confusion Matrix in the Classification Methods along with example............4
2. Defining two practical examples on applications of classification methods with explanations. 5
3. Over Sampling Partitioning before building the model.....................................................6
4. Explaining how explanatory and categorical variables can be used in logistic regression 7
SECTION B: QUANTITATIVE QUESTIONS..............................................................................8
5..............................................................................................................................................8
a. Explaining steps of KNN model in relation to making predictions about customers who will spend more than $1000 8
b. Developing a predictive model for making assessment about new female customer........8
6..............................................................................................................................................8
a. Analysing data set and giving recommendations...............................................................8
b. Building a model to predict the repair time for a future booking service..........................8
c. Giving recommendations in relation to adding variables in dataset for better assessment 9
7............................................................................................................................................10

a. Explaining approach in relation to filling missing values in excel sheet.........................10
b. Calculating average total protein for each blood type......................................................10
c. Computing range of total protein and explaining approach for the same.........................11
d. Presenting the extent to which protein is declined by age................................................13
e. Presenting two best visualisation tool for which is highly relevant to the current data set16
a............................................................................................................................................16
b............................................................................................................................................21
8............................................................................................................................................24
a. Assessment of annual investments and returns................................................................26
b. If Matthew aims to gain $1,500,000 at the end of the 30th year, what percentage of his salary he should put in the investment
annually................................................................................................................................27
9. Linear programming assessment......................................................................................28
a. Writing linear optimization model for company in order to make best decision.............28
b. Presenting and interpreting results...................................................................................28
c. Rewriting model when discounting strategy is applied....................................................28
REFERENCES..............................................................................................................................29

INTRODUCTION
Business analytic is the process that involves methodological exploration of the company's data with focusing on the statistical
analysis. It is been used by the organization for making decisions committed to the data driven. Business analytic is utilized for
gaining the insights that provides for suitable decisions regarding the business. It is said to be most useful in optimizing and in
automation of the business processes. It is majorly categorized into two major segments that are business intelligence and statistical
analysis. The present study is based on various aspects that relates with the business analytic. Furthermore, it includes the quantitative
questions and different models are also described under the study.
SECTION A: DISCUSSION QUESTIONS
1. Explaining Confusion Matrix in the Classification Methods along with example.
Confusion Matrix also known as Error Matrix. It is a process which assist in machine learning, mainly related to the problem of
statistical classification. In this process, there is a specific table layout which permits visualisation of performance of an algorithm. In
this method, each row of the matrix depicts about instances in the predicted or estimated class whereas each column defines the
instances related to the actual class or vice versa case (Salamon and Bello, 2017).
It is considered as one of the most special type of contingency table which is having two classes viz. are 'actual' as well as
'predicted'. It also consists of identical sets of classes. In the contingency table, each combination of dimension and class is of variable
nature.
For example: From a sample of 27 animals viz. 8 cats, 6 dogs and 13 rabbits, confusion matrix will be as follows:
Actual Class
Cat Dog Rabbit

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Predicted Class
Cat 5 2 0
Dog 3 3 2
Rabbit 0 1 11
In confusion matrix of 8 cats of actual, it was predicted by the system that 3 were dogs. Out of 6 dogs, prediction was made
that 1 was rabbit and 2 were cats. With the help of above matrix it can be interpreted that system is having trouble while distinguishing
between dogs and cats. Whereas distinction can be made easily between rabbit and other animal.
2. Defining two practical examples on applications of classification methods with explanations.
Classification is defined as a data mining process which is useful for estimating and predicting membership of group for data
instances. It is a technique of data mining which main role is to assign different categories to a data collection so as to make proper
and more accurate analysis as well as prediction. Application of classification methods are as follows:
Decision Tree Technique – This technique helps in producing a sequence of rules and standards with the given data attributes
along with its classes, helps in classifying data. It is easy to understand, interpret and visualise which needs preparation of data. This
method can handle both numerical as well as categorical data type (Deng and et.al., 2016).
K – Nearest Neighbor Technique – In this technique, the nearest neighbor is measured in the context of k value which defines
that how many nearest neighbors is required to be assessed so as to describe class of sample data point. It is used for Microarray data

classification, short term traffic flow forecasting, Agarwoord oil quality grading, face recognition etc. It is easy and simple for
implement, more effective in case of noisy and large training data.
3. Over Sampling Partitioning before building the model.
The term Over Sampling Partitioning is a statistical tool which assist in the process of analysis of data. This technique helps in
adjusting the class distribution of a data set. In other word, it helps in representing the ratio between different classes as well as
different categories.
The main role of this technique is related to statistical sampling, methodology related to survey design and provides support in
machine learning. Oversampling technique involves a process of introducing a bias to select more and best samples from one class
than from another class (Ebadi, Antignac and Sands, 2016). It is also compensated for an imbalance which is either present in the data
already or likely to develop in case when a purely random sample were taken. Oversampling techniques for classification problems are
as follows:
Random oversampling – This technique of random oversampling involves supplementing the data trained with multiple copies of
some of the minority classes. The process of oversampling can be done for more than once (2x, 3x, 5x etc.) In this system, instead of
repeating or duplicating each sample in the minority class, many of them can be randomly chosen with replacement.
ADASYN – Stands for the adaptive synthetic sampling approach or ADASYN algorithm. It is a process which builds on the
methodology of SMOTE by shifting the importance of classification boundary or standard to those of minority classes which are of
difficult nature (Slagter, Hsu and Chung, 2015). This technique makes use of weighted distribution for various minority class example
according to their difficulty level in learning, where more synthetic data are produced for minority class examples which are of hard
nature for learning.

4. Explaining how explanatory and categorical variables can be used in logistic regression
Logistic regression model is the appropriate technique for analysing and conducting the study for describing the data and for
effectively establishing the relationship between the dependent and the independent variable. It is also called as the predictive analysis
for evaluating the outcome. This resultant outcome is been measured under the model with the dichotomous variable. It is analysis that
is made for estimating the data value on the basis of the previous observations of data set. It has become a vital tool under the
discipline of learning. It is the approach that allows for using the algorithm for classifying the incoming data on the basis of historical
data. It is used for predicting the chance of winning or losing the situation in the future for any of the condition. A per the situation
given it is assumed that the two explanatory variable are been determined in the logistic regression model as the categorical variable
that is X1 and X2 where under X1 the categories are made as high, low and average while under X2 the classifications are made as
Sydney, Melbourne and Brisbane. These variables will be used as the dependent and the independent variable in the logistic regression
model. For example- The categorical variable that has low, high, average and the reference or the level of response variable is slow. In
case, the coefficient for the high resulted as 1.3, then the change in variable from the low to high raises natural log of odds of event by
the 1.3.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

SECTION B: QUANTITATIVE QUESTIONS
5.
a. Explaining steps of KNN model in relation to making predictions about customers who will spend more than $1000
b. Developing a predictive model for making assessment about new female customer
6.
a. Analysing data set and giving recommendations
Row Labels
Average of Repair time
(hours)
Average of Months since last
service
Average of Time of
service
James 9.03 6.5 1.5
John 9.10 6.17 1.5
Job 13.84 4.82 1.38
Grand
Total 11.20 5.61 1.45
b. Building a model to predict the repair time for a future booking service
Repair time (hours)
Months
since
last
service
Time
of
service
Type
of
repair
Repairperson
12.6 6 1 2 John
9.6 10 1 2 John
14 7 1 2 John
7.6 6 1 2 John
10.1 3 1 2 John
10 6 1 2 John
15.2 9 1 2 John

13.7 10 1 2 John
10 9 1 2 John
3.7 6 1 2 John
Average repair time in morning shift = 10.65 hours
Repair
time
(hours)
Months
since
last
service
Time
of
service
Type
of
repair
Repairperson
8.3 7 2 2 John
7.4 1 2 2 John
10.1 1 2 2 John
8.4 3 2 2 John
4.1 4 2 2 John
4.5 7 2 2 John
8.5 4 2 2 John
Mean repair time in afternoon shift = 7.33 hours
The above depicted analysis shows that afternoon shift proves to be more suitable for John. Moreover, in afternoon shift mean
time accounts for 7.33 hours, whereas in morning shift it was 10.65 hours. On the basis this aspect, John should be assigned with
afternoon shift rather than morning.
c. Giving recommendations in relation to adding variables in dataset for better assessment
In addition to type of repair, month in which last service provided also has influence on repair time. Thus, at the time of doing
assessment this factor also needs to be considered.

7.
a. Explaining approach in relation to filling missing values in excel sheet
Specifically there are five steps to fill the missing value in Excel sheet that are as follows-
 In the first step the rows and the columns will be have to be filled.
 In the next step, for opening the 'Got to' in the dialogue box, Ctrl+G has to be press. Thereafter, this box will be showing the
special button that has to be clicked for calculating the missing value.
 The third step, states for going to the special button from the option of find and search key in left side corner of excel sheet.
After clicking to the go to special option, blank option will be selected and then OK button will be clicked.
 In the last or the fifth step, formula bar will have to clicked or pressing the F2 button in keyboard. Hence, after this the value
can be entered in the provided space. The active cell or the blank cell will be getting the value. In case the same value is
needed in all blank cells, Ctrl+Enter key has to be press.
b. Calculating average total protein for each blood type
Labelled
as
Blood type Average
protein
level
1 O+ 7.21
2 O- 7.19
3 A+ 7.19

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

4 A- 7.26
5 B+ 7.21
6 B- 7.43
7 AB+ 7.27
8 AB- 7.07
Interpretation- From the above analysis it can be interpreted that different blood group contains different protein level as the O+
contains the average value of the protein level resulted as 7.21 while O- facilitates 7.19 of the protein level. However, the average
protein level of A+ and A- blood group provides 7.19&7.26 which shows that a slight difference is present between them. On the
other hand, B+ and B- blood group contains 7.21 & 7.43 protein level as deviated in the above table. For the blood group AB+ and
AB- the protein level resulted as 7.07. From the above table it has been assessed that B- is the blood group that gives the highest level
of the protein level and AB- contains the lowest average protein level in comparison with the other blood group.
c. Computing range of total protein and explaining approach for the same
Labelled
as
Blood type Max
Protein
level
Min
Protein
level
Range of
Protein
level (Max
– Min)
1 O+ 10.88 3.56 7.32
2 O- 10.93 4.46 6.47

3 A+ 9.26 4.66 4.6
4 A- 9.16 4.79 4.37
5 B+ 9.16 5.68 3.48
6 B- 10.62 5.31 5.31
7 AB+ 9.93 5.46 4.47
8 AB- 8.98 4.57 4.41
Interpretation- The above table shows the evaluation of the range of the protein which has been computed by deducting the
minimum level of protein from the maximum level of the protein. For O category of the blood group the range of the protein level
resulted as 7.32 & 6.47 for positive and negative both the type. A+ and A- type of blood group has the maximum protein level as 9.26
& 9.16 and the minimum level of the protein as 4.66 & 4.79 and by subtracting the minimum protein level from the maximum protein
level the range resulted as 4.6 and 4.37. The range for B+ and B- deviated as 3.48 & 5.31 that means a positive protein level. Lastly
for AB+ and AB- the range equated to 4.47 & 4.41 by applying the formula that is maximum less minimum level of protein. The
highest range of the protein level that is 7.32 contained from the O+ blood group which means that it provides high content of the
protein to the body. However, 4.37 is the lowest range of protein level that is been attained from the blood group A- so it could be said
that O+ is the better blood group than the other.
d. Presenting the extent to which protein is declined by age
Average protein eat each age level

Age
Average
of Total
Protoean
level
(g/dL)2
17 7.73
18 6.85
19 6.98
20 7.11
21 7.53
22 6.93
23 7.48
24 6.76
25 7.27
26 7.30
27 6.90
28 7.14
29 7.68
30 7.16
31 7.26
32 7.44
33 7.10
34 6.56
35 7.27
36 6.70
37 7.53
38 7.41
39 7.52

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

40 6.52
41 7.68
42 6.87
43 7.49
44 7.27
45 7.03
46 7.23
47 7.14
48 7.03
49 7.40
50 7.60
51 7.41
52 7.52
53 7.18
54 6.97
55 7.48
56 7.23
57 7.21
Interpretation- From the above analysis it is interpreted that with increase in the age the protein level fluctuates and not
necessary that it declines. The above table clearly shows that as the years passed and with increase in the age, the level of the protein
decreases as well as increases over the years.

e. Presenting two best visualisation tool for which is highly relevant to the current data set
The two visualization tool that best suits to the data set and creates high relevancy in the data set are the graphs and pivot table
that is used in the analysis with application of the appropriate techniques. Graphs helps in knowing the trend that is increasing or
decreasing. On the other hand, pivot table helps in evaluating and solving the required measures of the statistics.
a.
Age
Regression Statistics
Multiple R
0.8478
18
R Square
0.7187
95
Adjusted R
Square
0.7060
13
Standard Error
7.1983
57
Observations 24

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

ANOVA
df SS MS F
Sig
nifi
can
ce
F
Regression 1 2913.874
2913.87
4 56.23464
1.7
E-
07
Residual 22 1139.96
51.8163
5
Total 23 4053.833
Coefficie
nts
Standar
d Error t Stat P-value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept -2.03481
7.9318
81
-
0.2565
4
0.7999
21
-
18.484
5
14.414
9
-
18.484
5
14.414
9
Age 0.828609 0.1104 7.4989 1.7E- 0.5994 1.0577 0.5994 1.0577

96 76 07 54 65 54 65
Weight
Regression Statistics
Multiple R
0.4098
74
R Square
0.1679
97
Adjusted R
Square
0.1301
78
Standard Error
12.381
81
Observations 24
ANOVA
df SS MS F Signi
fican

ce F
Regression 1 681.03 681.03 4.442198
0.046
684
Residual 22 3372.803 153.3092
Total 23 4053.833
Coefficie
nts
Standar
d Error t Stat P-value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept 23.31929
15.905
52
1.4661
13
0.1567
66
-
9.6667
4
56.305
33
-
9.6667
4
56.305
33
Weight
(Kg) 0.419396
0.1989
88
2.1076
52
0.0466
84
0.0067
22
0.8320
71
0.0067
22
0.8320
71
Gender
Regression Statistics
Multiple R 0.1184

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

99
R Square
0.0140
42
Adjusted R
Square
-
0.0307
7
Standard Error
13.478
79
Observations 24
ANOVA
df SS MS F
Signi
fican
ce F
Regression 1 56.92424 56.92424 0.313325
0.581
302
Residual 22 3996.909 181.6777
Total 23 4053.833

Coefficie
nts
Standar
d Error t Stat P-value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercept 51.90909
8.5098
17
6.0999
07
3.86E-
06
34.260
81
69.557
37
34.260
81
69.557
37
Gender 3.090909
5.5218
98
0.5597
55
0.5813
02
-
8.3608
1
14.542
63
-
8.3608
1
14.542
63
b.
Null hypothesis (H0): There is no significant difference in the mean value of risk of diabetes with regards to person’s age, weight and
the gender.
Alternative hypothesis (H1): There is a significant difference in the mean value of risk of diabetes with regards to person’s age,
weight and the gender.
Regression Statistics
Multiple R
0.91003
208
R Square 0.82815

839
Adjusted R
Square
0.79198
121
Standard Error
6.05508
59
Observations 24
ANOVA
df SS MS F
Sign
ifica
nce
F
Regressio
n 4 3357.21609
839.30402
4
22.891733
9
4.80
31E-
07
Residual 19 696.617239
36.664065
2
Total 23 4053.83333

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Coeffici
ents
Standard
Error t Stat P-value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Intercep
t
-
21.9929
733
10.29231
26
-
2.13683
495
0.0458
3353
-
43.5350
311
-
0.4509
154
-
43.5350
311
-
0.45091
54
Age
0.80400
683
0.095131
16
8.45156
116
7.3444
E-08
0.60489
502
1.0031
1864
0.60489
502
1.00311
864
Weight
(Kg)
0.38656
242
0.113314
65
3.41140
721
0.0029
2792
0.14939
213
0.6237
3271
0.14939
213
0.62373
271
Gender
-
5.23431
886
2.897266
94
-
1.80664
018
0.0866
8324
-
11.2983
683
0.8297
3054
-
11.2983
683
0.82973
054
Life
style
-
0.60213
326
1.726930
01
-
0.34867
265
0.7311
6902
-
4.21663
931
3.0123
7278
-
4.21663
931
3.01237
278
Interpretation- From the above analysis it can be interpreted that regression among all the independent and the dependent
variables resulted as 0.046 which is lower than 0.05, depicts that the difference is present between the independent and the dependent

variable. The P value in the table reflects the presence of the difference between the variables and as the value for weight equates to
0.0029 which is lower than 0.05, means that it is the only independent variable which has the difference from risk of diabetes while all
the other independent variables does not show any presence of the difference. R Square is the component that indicates the fitness of
good where age is the variable that shows the highest value which is equals to 0.71 that is 71% which means that Age is the variable
that best fits with the risk of the diabetes. The value for other variables is too low such as for weight is evaluated as 16% and for
gender it resulted as 1.4% which reflects that these variables does not best fit to the risk of diabetes.
Strengths of regression analysis- This analysis includes multiple benefits as it states the relationship between the independent
and the dependent variable. Under this study the dependent variable is considered as risk of diabetes while the independent variables
are age, weight, gender and lifestyle. It reflects the influence of the several independent variables with that on the dependent variable.
Regression analysis helps in predicting the sales in long term and facilitates the understanding relating to the level of inventory. It is
the most useful technique that provides for assessment of the demand and the supply for the company towards its products and the
services. Regression analysis helps in reviewing that in what way the various variables impacts the different things. Mainly this
technique is considered as most crucial at the time of forecasting or in making the hypothesis.
8.
Computation of salary after tax
Year
Annual
salary
Taxable
amount
1
Taxable
amount
2
Taxable
amount
3 tax (category 1)
Tax
(category
2)
Tax (category
3)
Total
tax
Salary
after tax
1 80000 50000 30000 7500 6000 13500 66500
2 82400 50000 30000 2400 7500 6000 600 14100 68300
3 84872 50000 30000 4872 7500 6000 1218 14718 70154

4 87418 50000 30000 7418 7500 6000 1855 15355 72064
5 90041 50000 30000 10041 7500 6000 2510 16010 74031
6 92742 50000 30000 12742 7500 6000 3185 16685 76056
7 95524 50000 30000 15524 7500 6000 3881 17381 78143
8 98390 50000 30000 18390 7500 6000 4597 18097 80292
9 101342 50000 30000 21342 7500 6000 5335 18835 82506
10 104382 50000 30000 24382 7500 6000 6095 19595 84786
11 107513 50000 30000 27513 7500 6000 6878 20378 87135
12 110739 50000 30000 30739 7500 6000 7685 21185 89554
13 114061 50000 30000 34061 7500 6000 8515 22015 92046
14 117483 50000 30000 37483 7500 6000 9371 22871 94612
15 121007 50000 30000 41007 7500 6000 10252 23752 97255
16 124637 50000 30000 44637 7500 6000 11159 24659 99978
17 128377 50000 30000 48377 7500 6000 12094 25594 102782
18 132228 50000 30000 52228 7500 6000 13057 26557 105671
19 136195 50000 30000 56195 7500 6000 14049 27549 108646
20 140280 50000 30000 60280 7500 6000 15070 28570 111710
21 144489 50000 30000 64489 7500 6000 16122 29622 114867
22 148824 50000 30000 68824 7500 6000 17206 30706 118118
23 153288 50000 30000 73288 7500 6000 18322 31822 121466
24 157887 50000 30000 77887 7500 6000 19472 32972 124915
25 162624 50000 30000 82624 7500 6000 20656 34156 128468
26 167502 50000 30000 87502 7500 6000 21876 35376 132127
27 172527 50000 30000 92527 7500 6000 23132 36632 135895
28 177703 50000 30000 97703 7500 6000 24426 37926 139777
29 183034 50000 30000 103034 7500 6000 25759 39259 143776
30 188525 50000 30000 108525 7500 6000 27131 40631 147894

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

a. Assessment of annual investments and returns
Year Salary after tax
10% investment from
salary after tax Return on investment 5% Investment sum + return
1 66500 6650 332.5 6982.5
2 68300 6830 341.5 7171.5
3 70154 7015.4 350.77 7366.17
4 72064 7206.36 360.318 7566.68
5 74031 7403.05 370.153 7773.21
6 76056 7605.64 380.282 7985.93
7 78143 7814.31 390.716 8205.03
8 80292 8029.24 401.462 8430.71
9 82506 8250.62 412.531 8663.15
10 84786 8478.64 423.932 8902.57
11 87135 8713.5 435.675 9149.17
12 89554 8955.4 447.77 9403.17
13 92046 9204.57 460.228 9664.79
14 94612 9461.2 473.06 9934.26
15 97255 9725.54 486.277 10211.8
16 99978 9997.8 499.89 10497.7
17 102782 10278.2 513.912 10792.2
18 105671 10567.1 528.354 11095.4
19 108646 10864.6 543.23 11407.8
20 111710 11171 558.552 11729.6
21 114867 11486.7 574.333 12061
22 118118 11811.8 590.588 12402.4
23 121466 12146.6 607.331 12754

24 124915 12491.5 624.576 13116.1
25 128468 12846.8 642.338 13489.1
26 132127 13212.7 660.633 13873.3
27 135895 13589.5 679.477 14269
28 139777 13977.7 698.887 14676.6
29 143776 14377.6 718.878 15096.4
30 147894 14789.4 739.47 15528.9
Total 320200
b. If Matthew aims to gain $1,500,000 at the end of the 30th year, what percentage of his salary he should put in the investment
annually.
Calculation of present value:
Future value= $1500000
Number of periods= 30 years
Interest rate= 5%
Present value = Future value * [1+ (1+ interest rate) ^ n
= $347066.17
9. Linear programming assessment
a. Writing linear optimization model for company in order to make best decision
LPP MODEL