Two Sample and Independent Sample t-test

Verified

Added on  2021/04/21

|20
|2395
|21
AI Summary
The assignment involves conducting two-sample and independent-sample t-tests to determine if there is an extra premium for brick houses in neighborhood 3, compared to traditional neighborhoods (1 & 2) and newer neighborhoods (3). The analysis includes hypothesis testing, output interpretation, and a summary of the findings. Annotated bibliography references are also provided for additional reading.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running head: BUSINESS ANALYTICS AND DECISION MODELLING
Business Analytics and Decision Modelling
Name of the Student:
Name of the University:
Author’s Note:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
1BUSINESS ANALYTICS AND DECISION MODELLING
Table of Contents
1. Predicting Software Reselling Profits....................................................................................2
1.1. Exploratory Statistics:.....................................................................................................2
1.2. Scatter plot:.....................................................................................................................4
1.2.1. Freq vs. Spending scatter plot:.................................................................................4
1.2.2. Last Update vs. Spending scatter plot:.....................................................................5
1.3. Prediction of Spending:...................................................................................................5
1.3. A. Randomization of the samples and Preprocessing of categorical variables:..........5
1.3. B. Linear Regression Model:......................................................................................6
1.3. C..................................................................................................................................7
1.3. D..................................................................................................................................7
1.3. E..................................................................................................................................7
1.3. F...................................................................................................................................8
1.3. G..................................................................................................................................8
1.3. H..................................................................................................................................8
1.3. I...................................................................................................................................8
1.3. J...................................................................................................................................8
2. Housing Price Structure in “NOTAREAL” Township:.......................................................11
2.1. One sample t-test:..........................................................................................................11
2.2. One sample t-test:..........................................................................................................12
Document Page
2BUSINESS ANALYTICS AND DECISION MODELLING
2.3. Two Sample and Independent Sample t-test:................................................................12
2.4. Transformation of level of Neighbourhood:.................................................................15
Annotated Bibliography:..........................................................................................................16
Document Page
3BUSINESS ANALYTICS AND DECISION MODELLING
1. Predicting Software Reselling Profits
1.1. Exploratory Statistics:
US address vs. Spending
Out of 1000 samples, 167 customers dwell in US whose average spending is $213 with
standard deviation $201. Rest 833 customers who do not have US address have average
spending $204 with standard deviation $225.
Web Order vs. Spending:
Out of 1000 samples, 456 customers who did not place at least one order via web have
average spending $208 with standard deviation $223. Rest 544 customers who placed at least
one order via web have average spending $202 with standard deviation $219.
Gender vs. Spending:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
4BUSINESS ANALYTICS AND DECISION MODELLING
Among 1000 customers, 486 female customers spend an average amount of $210 with
standard deviation $223. 514 male customers spend an average amount of $201 with standard
deviation $219.
Addrress_res vs. Spending:
Among 1000 customers, 777 customers whose address is a not residence spend an average
amount of $211 with standard deviation $240. 223 customers whose address is a residence
spend an average amount of $185 with standard deviation $133.
Document Page
5BUSINESS ANALYTICS AND DECISION MODELLING
1.2. Scatter plot:
1.2.1. Freq vs. Spending scatter plot:
The scatter plot takes into account “Number of transactions in last year at source catalogue”
as independent and “Spending” as dependent variable. The fitted trend line indicates that the
fitting of linear regression is moderately good. A moderately strong linear association is
present.
Document Page
6BUSINESS ANALYTICS AND DECISION MODELLING
1.2.2. Last Update vs. Spending scatter plot:
The scatter plot takes into account “How many days ago was last update to customer record”
as independent and “Spending” as dependent variable. The fitted trend line indicates that the
fitting of linear regression is not good at all. Hence, no linear relationship exists.
1.3. Prediction of Spending:
1.3. A. Randomization of the samples and Preprocessing of categorical variables:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
7BUSINESS ANALYTICS AND DECISION MODELLING
The number of “Training” samples is 700 whereas the number of “validation” sample is 300.
1.3. B. Linear Regression Model:
Document Page
8BUSINESS ANALYTICS AND DECISION MODELLING
1.3. C.
The value of R2 (coefficient of determination) is 0.449. Hence, the independent
variables such as Frequency, Last Update, US_Address, Web_Order, Sex and Address_Res
can explain only 44.9% variability of dependent variable which is “Spending”.
1.3. D.
According to the ANOVA model, the p-value of F-statistic is 0.0.
The whole model is significant as 0.0<0.05.
Null Hypothesis (H0): There is no significant linear association between dependent variable
and independent variables in the linear regression model.
Alternative Hypothesis (HA): There is significant linear association between dependent
variable and independent variables in the linear regression model.
As calculated p-value is less than 5% level of significance, therefore it is 95% evident
that we reject the null hypothesis of significant association between dependent and
independent variables. The alternative hypothesis is accepted.
1.3. E.
In the linear regression model, all the factors are not significant. Frequency,
Last_Update and Address_Res (0.0<0.05) are found significant. The US_Address,
Web_order and Sex are significant factors as their p-values are greater than 0.05.
Document Page
9BUSINESS ANALYTICS AND DECISION MODELLING
1.3. F.
All significant multicollinearity indexes VIF (Collinearity statistic) are between 1 and
2. If VIF is in between 1 to 10, then no multicollinearity is found. In accordance to that fact,
in any independent factor, significant multicollinearity is observed.
1.3. G.
On the basis of this model, the female customers of outside US, do not places order
via web and do not have residential address, have higher number of transactions in last year
and lesser amount of days for updating customer record are most likely to spend a large
amount of money.
1.3. H.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
10BUSINESS ANALYTICS AND DECISION MODELLING
The prediction values are computed by the linear regression model in case of
“validation” set of 300 samples:
Spending = -56.688 + 77.797*Frequency - 0.021*Last Update - 4.63*US_Address
5.335*Web_Order – 11.192*Sex + 80.455*Address_Res.
According to the model, putting the values of categorical variables “Yes” = 1 and
“No” = 2 to the corresponding numeric variables, prediction values are computed.
For the first purchase, Spending = $(-56.688+77.797*1–0.021*3215–4.63*1-5.335*2-
11.92*1+80.455*2) = $87.284.
After finding prediction values of dependent variables, we replace actual values by
predictor values. Then, we find the mean of predictor values of dependent variables. The sum
of mean deviations about mean (-ve values are taken as negative) of the predictor values is
treated as prediction error.
1.3. I.
The value of R-square is 0.383 (38.3%). Hence, the predictive accuracy of the
regression model of the validation set (300 samples) is not very high.
After finding the prediction of all the spending, we can find the average of mean
deviations about mean (-ve values are taken as positive values) of the predictor values is
treated as mean absolute difference (MAD). The sum of square roots of differences of each
predicted and actual values is known as RMSE (root mean square error). The percentage
share of all the predictor values is calculated in a new column. It is the (100/n) multiplied
with sum of relative ratio of deviation of prediction with respect to actual value. R2 is [1-(sum
of square of residual values/sum of square of total values)] of the regression model. Standard
error is the square root of (1-R2) multiplied by predicted values and divided by number of
samples.
Document Page
11BUSINESS ANALYTICS AND DECISION MODELLING
The regression equation obtained from “training” (700 samples) dataset is given as-
Spending=
-121.221+86.244*Frequency-0.011*Last_Update+21.336*US_Address+5.433*Web_order-
4.334*Gender+78.483*Address_Res
We apply regression model of “training” dataset in “validation” data set.
Now, we calculate the predicted Y values of “validation” dataset with the linear regression
model.
1.3. J.
The Histogram, Normal probability plot and Residual plot (scatter plot) of all the 1000
samples:
The histogram of the residuals shows that the residuals are not properly normally
distributed.
Besides, the normal probability plot indicates that the residuals are not absolutely
normally distributed.
The scatterness of residual values interpret that the deviation from normality
assumption has affected the performance of the regression model.
Document Page
12BUSINESS ANALYTICS AND DECISION MODELLING
Histogram plot of Residuals
Normal Probability Plot of Residuals

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
13BUSINESS ANALYTICS AND DECISION MODELLING
Scatter plot of Residuals vs. Predictive Values
Document Page
14BUSINESS ANALYTICS AND DECISION MODELLING
2. Housing Price Structure in “NOTAREAL” Township:
2.1. One sample t-test:
It is a fact of investigation whether buyers pay an equal premium for a brick house or
not. The number of samples whose houses are made of bricks is 42.
Hypotheses:
Null hypothesis (H0): The premium price for all brick houses is equal to $1500000.
Alternative hypothesis (HA): The premium price for all brick houses is greater than
$1500000.
Tests and Level of significance:
One sample t-test is hereby applied for selling prices of brick houses and the level of
significance is assumed to be 5%.
Outputs:
The t-statistic = (-0.539), significant p-value = 0.593.
Interpretation:
As, significant p-value is greater than level of significance (0.593>0.05), therefore, we accept
the null hypothesis that people pay equal premium prices for their brick houses with 95%
possibility.
Document Page
15BUSINESS ANALYTICS AND DECISION MODELLING
2.2. One sample t-test:
It is a fact of investigation whether buyers pay an equal premium for neighbourhood 3
or not. The number of samples whose houses are in neighbourhood 3 is 39.
Hypotheses:
Null hypothesis (H0): The premium price for all houses in neighbourhood 3 is equal to
$1500000.
Alternative hypothesis (HA): The premium price for all houses in neighbourhood 3 is greater
than $1500000.
Tests and Level of significance:
One sample t-test is hereby applied for selling prices of houses in neighbourhood 3 and the
level of significance is assumed to be 5%.
Outputs:
The t-statistic = (2.934), significant p-value = 0.006.
Interpretation:
As, significant p-value is greater than level of significance (0.006<0.05), therefore, we cannot
accept the null hypothesis that people pay premium price for their houses in neighbourhood 3
equal to $150000 and accept the alternative hypothesis of greater average premium prices
with 95% probability.
2.3. Two Sample and Independent Sample t-test:
It is a matter of question if there is an extra premium for a brick house in
neighbourhood 3, in addition to the usual premium for a brick house.
The total number of brick houses is 42 out of which 16 brick houses are in neighbourhood 3.
In the first case, we consider two samples: one is all the 42 brick houses and 16 brick hoses
whose neighbourhood is 3.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
16BUSINESS ANALYTICS AND DECISION MODELLING
In the second case, we consider two samples: one is 26 brick houses whose neighbourhood is
traditional (1 & 2) and the other is 16 brick houses whose neighbourhood is newer (3).
Hypotheses:
Null hypothesis (H0): The premium price for all brick houses is equal to the brick houses
whose neighbourhood is newer.
Alternative hypothesis (HA): The premium price for all brick houses is lesser than the brick
houses whose neighbourhood is newer.
Tests and Level of significance:
Two sample t-test (for equal and unequal variances) is hereby applied for selling prices of
brick houses and the level of significance is assumed to be 5%.
Outputs:
The mean values of both types of houses are respectively $147769 and $175200 with
variances $719815848 and $265541333 respectively. The difference of variances is
significant. Hence, two sample t-test for unequal variances would be more appropriate than
two sample t-test for equal variances. In two sample t-test of unequal variances, the
significant p-value = 0.000339.
Interpretation:
As, significant p-value is less than level of significance (0.000339<0.05), therefore, we reject
the null hypothesis that people pay equal premium price for their usual brick houses with
respect to brick houses of neighbourhood 3 with 95% possibility.
Therefore, brick houses of newer neighbourhood pay more premium prices than brick houses
of all three types of neighbourhood.
Document Page
17BUSINESS ANALYTICS AND DECISION MODELLING
Hypotheses:
Null hypothesis (H0): The premium price for the brick houses of traditional neighbourhood is
equal to the brick houses whose neighbourhood is newer.
Alternative hypothesis (HA): The premium price for the brick houses of traditional
neighbourhood is lesser than the brick houses whose neighbourhood is newer.
Tests and Level of significance:
Independent sample t-test (for equal and unequal variances) is hereby applied for selling
prices of brick houses and the level of significance is assumed to be 5%.
Outputs:
The mean values of both types of houses are respectively $130888 and $175200 with
variances $15596 and $16295 respectively. In two sample t-test of unequal or equal
variances, the significant p-value = 0.000.
Interpretation:
The p-values are less than 5%. Therefore, we reject the null hypothesis is equality of average
premium prices and accept the alternative hypothesis of lesser premium prices.
Therefore, brick houses of newer neighbourhood pay more premium prices than brick houses
of traditional neighbourhood.
Document Page
18BUSINESS ANALYTICS AND DECISION MODELLING
2.4. Transformation of level of Neighbourhood:
Yes, as above pie chart, for the purposes of estimation and prediction, traditional
neighbourhoods (1 and 2) could be collapsed into a single “Older” neighbourhood.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
19BUSINESS ANALYTICS AND DECISION MODELLING
Annotated Bibliography:
Baird, G.L. and Bieber, S.L., 2016. The Goldilocks dilemma: Impacts of multicollinearity--a
comparison of simple linear regression, multiple regression, and ordered variable regression
models. Journal of Modern Applied Statistical Methods, 15(1), p.18.
Cronk, B.C., 2017. How to use SPSS®: A step-by-step guide to analysis and interpretation.
Routledge.
Menard, S., 2018. Applied logistic regression analysis (Vol. 106). SAGE publications.
Yockey, R.D., 2016. SPSS demystified: a simple guide and reference. Routledge.
1 out of 20
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]