ETS2111: Business Data Modelling Assignment - Semester 2

Verified

Added on  2022/09/29

|11
|1543
|26
Homework Assignment
AI Summary
This assignment solution for ETS2111, a business data modelling course, addresses three key questions involving statistical analysis. Question 1 utilizes the Chi-square test to determine the independence of gender and purchasing behavior, presenting a contingency table, calculating expected frequencies, and interpreting the test statistic to fail to reject the null hypothesis. Question 2 employs ANOVA to compare the mean ages of workers across four locations, providing a one-way ANOVA summary table and conducting t-tests to identify significant differences between plant pairs. Question 3 explores regression analysis, examining correlations between variables like price, age, and attendance, constructing scatterplots, and comparing two regression models (Model 1 and Model 2) to determine the best explanatory variables for price variation, interpreting coefficients, and discussing the impact of control variables. The solution includes detailed explanations, Excel outputs, and statistical inferences to support the findings.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running head: ESTS2111 1
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
ESTS2111 2
ETS2111 Business Data Modelling Assignment
Question 1
a) Gender is a categorical variable whose measurement scale (Male and Female) is the
ordinal scale. Similarly, ‘Buy Category' is a categorical variable with ordinal
measurement scale (High, Medium and Low)
b) Table 1 provides frequency counts for cells created based on categories of variables
‘Gender' and ‘Buy Category'
Buy-Category
Gender High Low Medium
Grand
Total
Female 69 79 56 204
Male 39 61 46 146
Grand Total 108 140 102 350
Table 1: A contingency Table of Gender vs. Buy-Category counts
c) To test whether purchasing behavior is related to the gender of the customer, we use
the Chi-Square test for independence or association. Assuming that the test statistic is
Q,
Q Xk
2
Where k is a positiveinteger indicating the degrees of freedom .
For a Chi-Square test of association, the degrees of freedom (DF) is given by:
DF (k)= ( r1 )( c1)
Where cr arethe levels of each categorical variable .
Therefore, From Table 1,
k = ( 21 )( 31 ) =2
Thus
Q X2
2
Document Page
ESTS2111 3
d) To test whether purchasing behavior is related to the gender of the customer, the first
step involves stating the null and alternative hypothesis:
H0 :GenderPurchasing Behaviour are independent
Ha :GenderPurchasing behavior are not independent
Next, the Chi-Square test statistic is determined.
χ2= ( Observed ValueExpected Value )2
Expected Value
Each expected value is determined by multiplying each row total by each column total
and dividing by the overall total. Therefore, using Table 1, the following table of
expected frequencies is generated:
Expected Frequencies
Buy-Category
Gender High Low Medium Total
Female 62.9486 81.6000 59.45142857 204
Male 45.0514 58.4000 42.54857143 146
Total 108 140 102 350
Table 2: Table of Expected Frequencies
Using Table 2 Values and Chi-Square Formula, a table of Chi-Square values is generated:
Chi-Square Values
0.5817 0.0828 0.2004
0.8128 0.1158 0.2800
Table 3: Chi-Square values (Test statistic)
Therefore, Chi-Square test statistic,
¿ 0.5817+ 0.0828+0.2004+0.8128+ 0.1158+0.28=2.0735
Using CHIINV function in excel at DF=2 and 1% level of significance, the critical value is
9.2103. Given that the test statistic is less than the critical value, we fail to reject the null
Document Page
ESTS2111 4
hypothesis and conclude that there is significant statistical evidence to indicate that gender
and purchasing behavior is independent.
Question 2
a) ANOVA test could be used to compare the mean ages of workers at the four
locations. The types of variations that could be used include unidirectional (one-way)
and multidirectional (two-way) variations. In this case, one-way ANOVA is
appropriate because only one independent variable being considered. An ANOVA
test, it is assumed that the test statistic (response variable residuals) follows a normal
distribution. In a one-way ANOVA test, the test statistic depends on a set of three
degrees of freedom: Between the Groups DF, Within the Groups DF and Total DF.
BetweenGroups DF=k1 ; where k isthe total number of groups . Therefore ,
BetweenGroups DF=41=3
WithinGroups DF=N k ; where N is the total number of observations. Thus ,
WithinGroups DF=244=20
Total DF =N1=241=23
b) The null and alternative hypotheses in this case are:
H0 : μA =μB =μC=μ
Ha : μi μm
Where μiμm arethe means of any of thetwo samples ,μA μD are themeans of
sample A ¿ D
Table 4 below summarizes the output of One-Way ANOVA test in Excel:
One-Way ANOVA
SUMMARY
Groups Count
Su
m
Averag
e
Varianc
e
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
ESTS2111 5
A 7 201 28.71
2.23809
5
B 6 195 32.50 3.5
C 5 124 24.80 0.7
D 6 161 26.83
10.9666
7
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups
181.063
1 3 60.35
13.6298
7
4.53E-
05
3.09839
1
Within Groups 88.5619 20 4.43
Total 269.625 23
Table 4: One-way ANOVA Test Results (Excel Output)
As shown in Table 4, the p-value of the test is 0.000045. Accordingly, at a 5% level
of significance, given that the p-value is less than 0.05, we reject the null hypothesis
and conclude that there is significant statistical evidence to show that there is a
significant difference in the mean ages of the workers.
c) To determine the pairs of plants that show a significant difference at 5% level of
significance, we conduct T-tests for each pair. Given that the plants are A, B, C and
D, the pairs are A and B, A and C, A and D, B and C, B and D, and C and D. With the
hypotheses:
H0 : μ1=μ2
H1 : μ1 μ2
Summary of P-Values
A B C D
A 0.001293 0.000179438 0.240155
B 4.10402E-05 0.0065
C 0.197374
Table 5: A summary of p-values from T-tests
As shown in Table 5, for A and D, and C and D, the p-values are greater than 0.05 hence, at a
5% level of significance, we fail to reject the null hypothesis and conclude that the means are
Document Page
ESTS2111 6
equal. Therefore, the pairs with of plants showing significant differences in mean age are A
and B, A and C, B and C, and B and D.
Question 3
a) Scatterplots and correlation coefficients:
Price
(£) Age (Years)
Attendance
Size
Previous
Attendance
Price (£) 1
Age (Years)
0.91965
5 1
Attendance Size
0.89033
5 0.80337544 1
Previous
Attendance
0.38247
3
0.32647854
6 0.205883491 1
Table 6: Correlation coefficients between variables
As shown in Table 6, (Price, Age), (Price, Size), and (Age, Size) have a strong and positive
correlation. These relationships seem reasonable. Given that auctioneer is selling antique and
semi-antique rugs, the older the rug the higher the price hence justifying the Price-Age
relationship. Similarly, high attendance implies higher competition hence higher prices.
Lastly, in the antiquate and semi-antiquate products, the more the age the higher the demand
hence validating the relationship suggested by the correlation coefficient. Contrarily, Price
and previous attendance have a weak but positive correlation. This relationship is also
reasonable returning customers usually demonstrate previous satisfaction hence likely to bid
higher. However, because of other factors such as the number of people in the auction that
influence the price, the relationship is weak. Besides, returning customers may be given
discounts hence lower prices.
Document Page
ESTS2111 7
5000 10000 15000 20000 25000 30000 35000
0
20
40
60
80
100
120
140
160
180
200
Scatter plot Price vs. Age
Price in British Pounds
Age
Figure 1: Scatter plot of Price Vs. Age
5000 10000 15000 20000 25000 30000 35000
0
20
40
60
80
100
120
140
Scatter Plot: price vs. Size
Price
Size
Figure 2: A Scatter plot of price against Attendance Size
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
ESTS2111 8
5000 10000 15000 20000 25000 30000 35000
0
2
4
6
8
10
12
14
Scatterplot: Price vs. Previous Attendance
Price
Previous Attendance
Figure 3: A Scatter of Price against Previous Attendance
40 60 80 100 120 140 160 180 200
0
20
40
60
80
100
120
140
Scatter plot: Size vs. Age
Age
Size
Figure 4: Scatter plot of Size against age
b) The best explanatory variable to explain the variations in the ‘Price’ because of
having the highest correlation coefficient and meeting the linearity test as
demonstrated in Figure 1.
c) As indicated by the value of R-Square in Table 7, 92.33% of the variation in ‘Price’ is
explained by the explanatory variables (Age, Size and Previous attendance).
Document Page
ESTS2111 9
Regression Output Summary
Statistic P-value
R Square
0.92328
3
Intercept
2395.32
6
0.03860
3
Age (Years) 73.7991
4.59E-
05
Attendance Size
101.812
1
0.00018
7
Previous
Attendance
173.092
2
0.06864
6
Table 7: Regression Output Summary
d) Price=β0 +β1 Age+ β2 ¿ β ¿3 Prev . Attendance+ ε
Based on the information in Table 7,
Price=2395.33+73.8 Age+101.81 ¿ rev . Attendance
The coefficient of the explanatory variable ‘Age' is 73.8. This implies that if the age
of the antiquate and semi-antiquate rug increase by one year, the price of the rug
increases by 73.8 pounds.
e) To identify the reasonableness of the variables, we test whether the coefficients are
significant. Therefore,
H0 : βi=0
Ha : βi 0
At a 1% level of significance, the auctioneer cannot conclude that all explanatory
variables in Model 1 are reasonable. As shown in Table 7, the p-value of ‘previous
attendance' is 0.07. Given that 0.07 is greater than 0.01, we fail to reject the null
hypothesis and conclude that ‘previous attendance’ is insignificant. Contrarily, the p-
value of Age and Size is less than 0.01 hence the coefficients are significant.
Accordingly, it is reasonable to only include Age and Size.
Document Page
ESTS2111 10
f) Model 2 is performing better than Model 1. This is demonstrated by higher values of
S squared and Adjusted R Squared, lower standard error. In terms of significance,
both models are significance as intimated by F significance.
Model # R2 Adj-R2
Standar
d Error
F-
statistic t-ratios
Model 2
0.93382
9
0.92122
5 1373.91
74.0896
6
4.4736E-
12
Model 1
0.92328
3
0.91282
1
1445.33
6
88.2556
8
2.02076E
-12
Table 8: A comparison of Model 1 and Model 2’s key statistics
g) ‘Age*Size' is a control variable and it is added to facilitate removal of their effect on
the model if it is suspected that they have an influence. In this case, it is not
reasonable because it is statistically meaningless hence likely to falsely improve the
model.
h) This anomaly results from the fact that price and Age*Size are automatically
correlated hence improving the proportion of variation in price explained by the
explanatory variables. However, because this control variable does not meet the
assumption of linear regression, it does not contribute to the linear relationships.
i) Age=90
¿ 110
Prev . Attendance=8
Age¿ 9900
Using the information in Table 9,
^Price=2840.78+124.1990+175.46110+199.6280.68880=£ 22510
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
ESTS2111 11
Coefficients
Intercept
-
2840.78241
5
Age (Years)
124.191193
9
Attendance Size
175.464153
6
Previous
Attendance
199.618991
9
Age*Size
-
0.67922556
1
Table 9: Model 2 Coefficients
Confidence Level
Prediction
Intervals for
Price
0.95 Predicted Y Standard
Error Lower 95% Upper
95.0%
Prediction
Interval
22510.1008
2 1373.910156 19794.77945
25225.4221
8
Table 10: Prediction Interval calculation
PI =(19794.78 , 25225.42)
chevron_up_icon
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]