Applied Statistics: ANOVA and Multiple Regression Analysis

Verified

Added on 2023/05/31

AI Summary

This article discusses the use of ANOVA and multiple regression analysis in applied statistics. It includes examples of hypothesis testing and model building using R programming.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

RUNNING HEADER: APPLIED STATISTICS 1
Applied Statistics
Students name:
Students ID:
Stat270/680

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Applied Statistics 2
Question 1
a.
The hypothesis tests that can be used to determine if the depths of archeological discoveries vary
by site include ANOVA, linear regression and t-test.
b.
The hypothesis test chosen is ANOVA. The statistics tests was chosen since the data meets most
of the assumptions of an ANOVA test. The first and second assumption were met since the
dependent variable was continuous in nature while the independent variable was categorical in
nature.
> boxplot(Depth ~ Site, data = excavate)
From the boxplot above, it can be seen that there is constant variability between the sites. Thus,
there is independence of observation.
Consequently, the data was seen to be normally distributed as seen in the generated figure below:
> plot(depth.1, which = 1:2)

Applied Statistics 3
The only assumption that was violated was the absence of outliers. It was seen that there outliers
in the data as seen at point 36, 28 and 33.
c.
H0: The depths of the archeological discoveries do not vary by site
H1: The depths of the archeological discoveries vary by site
> View(excavate)
> depth.1 = lm(Depth ~ Site, data = excavate)
> anova(depth.1)
Analysis of Variance Table
Response: Depth
Df Sum Sq Mean Sq F value Pr(>F)
Site 3 2697.5 899.16 3.3514 0.02752 *
Residuals 43 11536.5 268.29
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The computed test statistics from R programming is 3.3514. The null distribution for the test is
an F-distribution since it an F-test. Consequently, the p-value of the one-way ANOVA is
0.02752. Since the p-value is less than 0.05, we choose to reject the null hypothesis. Thus, it can
be concluded that the depths of the archeological discoveries vary by site. In addition, the result
is statistically significant.
Question 2
a.
The statistical model to be used will be:
batt_avg = c + runs + doubles + triples + home_runs + strike
b.
> plot(baseball)

Applied Statistics 4
Batt_avg correlated highly with runs but less with doubles and triples. Strike correlates
negatively with batt_avg and runs. Consequently, it can be seen that doubles weakly correlated
with triples. Low correlation among the variables make the variables suitable for a multiple
regression.
c.
> baseball.lm = lm(batt_avg ~ runs + doubles + triples + home_runs + strike, data =
baseball)
> summary(baseball.lm)
Call:
lm(formula = batt_avg ~ runs + doubles + triples + home_runs +
strike, data = baseball)
Residuals:
Min 1Q Median 3Q Max
-0.03970 -0.01143 -0.00101 0.01044 0.03444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18316 0.01714 10.685 3.79e-13 ***
runs 0.44668 0.10963 4.074 0.000219 ***

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Applied Statistics 5
doubles 0.99090 0.31309 3.165 0.003005 **
triples 0.62160 0.58070 1.070 0.291004
home_runs 0.27374 0.16935 1.616 0.114060
strike -0.28456 0.05177 -5.497 2.59e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01748 on 39 degrees of freedom
Multiple R-squared: 0.8601, Adjusted R-squared: 0.8422
F-statistic: 47.96 on 5 and 39 DF, p-value: 1.261e-15
It is evident that triples and home_runs are not statistically significant.
d.
H0: There is a relationship between the response and the predictors
H1: There is no relationship between the response and the predictors
> baseball.2 = lm(batt_avg ~ runs + doubles + strike, data = baseball)
> baseball.2 = update(baseball.lm, . ~ . - triples - home_runs)
> summary(baseball.2)
Call:
lm(formula = batt_avg ~ runs + doubles + strike, data = baseball)
Residuals:
Min 1Q Median 3Q Max
-0.034709 -0.011472 -0.001311 0.011062 0.034968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16941 0.01514 11.188 4.92e-14 ***
runs 0.57494 0.07861 7.314 5.96e-09 ***
doubles 1.11908 0.29772 3.759 0.000533 ***
strike -0.26431 0.04539 -5.824 7.71e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01765 on 41 degrees of freedom
Multiple R-squared: 0.85, Adjusted R-squared: 0.8391
F-statistic: 77.47 on 3 and 41 DF, p-value: < 2.2e-16
> anova(baseball.lm)
Analysis of Variance Table
Response: batt_avg
Df Sum Sq Mean Sq F value Pr(>F)
runs 1 0.057976 0.057976 189.7528 < 2.2e-16 ***
doubles 1 0.003872 0.003872 12.6714 0.0009955 ***
triples 1 0.002173 0.002173 7.1111 0.0110942 *
home_runs 1 0.000024 0.000024 0.0776 0.7820426
strike 1 0.009231 0.009231 30.2116 2.589e-06 ***

Applied Statistics 6
Residuals 39 0.011916 0.000306
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(baseball.2)
Analysis of Variance Table
Response: batt_avg
Df Sum Sq Mean Sq F value Pr(>F)
runs 1 0.057976 0.057976 186.062 < 2.2e-16 ***
doubles 1 0.003872 0.003872 12.425 0.001057 **
strike 1 0.010568 0.010568 33.914 7.709e-07 ***
Residuals 41 0.012776 0.000312
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Source df SS MS F-value
home runs + triples | runs + doubles +
strike
1 0.00086 0.00086 2.810458
runs + doubles + strike 3 0.072416
Error 39 0.011916 0.000306
Total 43 0.085192
P-value = P (F1,39 ≥ 2.810458) = 0.101651
Since the p-value is greater than 0.05, we therefore chose to accept the null hypothesis. Thus,
there is a relationship between the response and the predictors.
e.
The Normal Q-Q plot of residuals has slight curvature but close to linear implying errors close to
normally distributed. The residuals vs fitted has not discernable pattern.

Applied Statistics 7
Consequently, residuals vs predictor plots no obvious pattern. So linear model seems adequate.
Thus, multiple regression that encompasses all the variables will be used.
f.
> baseball.lm = lm(batt_avg ~ runs + doubles + triples + home_runs + strike, data =
baseball)
> summary(baseball.lm)
Call:
lm(formula = batt_avg ~ runs + doubles + triples + home_runs +
strike, data = baseball)
Residuals:
Min 1Q Median 3Q Max
-0.03970 -0.01143 -0.00101 0.01044 0.03444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18316 0.01714 10.685 3.79e-13 ***
runs 0.44668 0.10963 4.074 0.000219 ***
doubles 0.99090 0.31309 3.165 0.003005 **
triples 0.62160 0.58070 1.070 0.291004

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Applied Statistics 8
home_runs 0.27374 0.16935 1.616 0.114060
strike -0.28456 0.05177 -5.497 2.59e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01748 on 39 degrees of freedom
Multiple R-squared: 0.8601, Adjusted R-squared: 0.8422
F-statistic: 47.96 on 5 and 39 DF, p-value: 1.261e-15
It can be seen that 84.22% of variability in the regression model is caused by factors in the model
while 15.78% of the variability is caused by factors not in the model. Consequently, the
regression is statistically significant since p < 0.05. The constant batt_avg is 0.183 which is
statistically significant. However, it was found out that triples and home_runs are not statistically
significant. Thus, keeping all factors constant, runs and doubles increase batt_avg by 0.45 and
0.99 units respectively However, strike decreases batt_avg by 0.28 units.
Question 3
a.
> View(incomes)
> par(mfrow=c(1,3))
> plot(incomes)
> income.1 = lm(Income ~ Rank, data = incomes)
> summary(income.1)
Call:
lm(formula = Income ~ Rank, data = incomes)
Residuals:
Min 1Q Median
-164966 -58546 -16978
3Q Max
47696 203457
Coefficients:
Estimate
(Intercept) 438835
Rank -50354
Std. Error t value
(Intercept) 11995 36.58
Rank 2081 -24.20
Pr(>|t|)
(Intercept) <2e-16 ***
Rank <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 72690 on 182 degrees of freedom

Applied Statistics 9
Multiple R-squared: 0.7629, Adjusted R-squared: 0.7616
F-statistic: 585.6 on 1 and 182 DF, p-value: < 2.2e-16
b.
> abline(income.1)
> plot(income.1,which=1:2)
It can be seen that there is some slight curvative (concave up) in the quantile plot of residuals.
Thus, the residuals not very close to normally distributed. However, there is a positive quadratic
trend in residuals, suggesting quadratic or higher order fit. Finally, it is seen that the scatterplot
of data does not appear linear.
c.
> model1.lm = lm(Income ~ Rank, data = incomes)
> summary(model1.lm)
Call:
lm(formula = Income ~ Rank, data = incomes)
Residuals:
Min 1Q Median 3Q Max
-164966 -58546 -16978 47696 203457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 438835 11995 36.58 <2e-16 ***
Rank -50354 2081 -24.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 72690 on 182 degrees of freedom
Multiple R-squared: 0.7629, Adjusted R-squared: 0.7616
F-statistic: 585.6 on 1 and 182 DF, p-value: < 2.2e-16
> AIC(model1.lm)
[1] 4645.538
> model2.lm = lm(Income ~ Rank + I(Rank^2), data = incomes)

Applied Statistics 10
> summary(model2.lm)
Call:
lm(formula = Income ~ Rank + I(Rank^2), data = incomes)
Residuals:
Min 1Q Median 3Q Max
-127622 -22179 -132 27474 108581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 624270.1 10936.5 57.08 <2e-16 ***
Rank -150944.8 4910.0 -30.74 <2e-16 ***
I(Rank^2) 10031.2 476.6 21.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39260 on 181 degrees of freedom
Multiple R-squared: 0.9312, Adjusted R-squared: 0.9305
F-statistic: 1225 on 2 and 181 DF, p-value: < 2.2e-16
> AIC(model2.lm)
[1] 4419.833
> model3.lm = lm(Income ~ Rank + I(Rank^2) + I(Rank^3), data = incomes)
> summary(model3.lm)
Call:
lm(formula = Income ~ Rank + I(Rank^2) + I(Rank^3), data = incomes)
Residuals:
Min 1Q Median 3Q Max
-95920 -21821 1140 20127 80071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 743524.7 15458.3 48.099 <2e-16 ***
Rank -267348.4 12936.6 -20.666 <2e-16 ***
I(Rank^2) 37506.8 2928.3 12.808 <2e-16 ***
I(Rank^3) -1815.8 191.8 -9.467 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 32170 on 180 degrees of freedom
Multiple R-squared: 0.9541, Adjusted R-squared: 0.9533
F-statistic: 1247 on 3 and 180 DF, p-value: < 2.2e-16
> AIC(model3.lm)
[1] 4347.479
> model4.lm = lm(Income ~ Rank + I(Rank^2) + I(Rank^3) + I(Rank^4), data = incomes)
> summary(model4.lm)
Call:
lm(formula = Income ~ Rank + I(Rank^2) + I(Rank^3) + I(Rank^4),

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Applied Statistics 11
data = incomes)
Residuals:
Min 1Q Median 3Q Max
-90150 -21165 -263 19727 72992
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 836754.91 26709.33 31.328 < 2e-16 ***
Rank -396016.21 33094.78 -11.966 < 2e-16 ***
I(Rank^2) 88544.34 12492.78 7.088 3.03e-11 ***
I(Rank^3) -9441.91 1828.37 -5.164 6.39e-07 ***
I(Rank^4) 380.81 90.84 4.192 4.34e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30780 on 179 degrees of freedom
Multiple R-squared: 0.9582, Adjusted R-squared: 0.9573
F-statistic: 1026 on 4 and 179 DF, p-value: < 2.2e-16
> AIC(model4.lm)
[1] 4332.246
The best model from the list of the data is model 4. A factor that is attributed to the fact that
model 4 has the lowest AIC of 4332.246.
d.
Income = 836754.91 -396016.21R + 88544.34R^2 – 9441.91R^3 + 380.81R^4
Income = 128,050

1 out of 11

Applied Statistics: ANOVA and Multiple Regression Analysis

Contribute Materials

Secure Best Marks with AI Grader

Secure Best Marks with AI Grader

Paraphrase This Document

Secure Best Marks with AI Grader

Related Documents

Comparison of Two Experiments on Perimeter of Airway Basement Membrane (PBM)

Analyzing Registered and Casual Users: Linear Regression, T-Test, and ANOVA Results

+13062052269

info@desklib.com