logo

Environmental Data Analysis - Assignment

Fit a full model with all variables, create diagnostic plots, identify problems with the fit, determine unnecessary variable based on model summary, compare AIC for different models

5 Pages1407 Words16 Views
   

Added on  2022-09-01

Environmental Data Analysis - Assignment

Fit a full model with all variables, create diagnostic plots, identify problems with the fit, determine unnecessary variable based on model summary, compare AIC for different models

   Added on 2022-09-01

ShareRelated Documents
Environmental Data Analysis - Assignment 11
Use the following for questions 1–6. In 1885, Francis Galton studied the relationship between the
height of adult children and their parents. This study is where the term “regression” comes from (see
Exercise 26 and Display 7.18 in the text). Galton measured the heights of 933 children from 205
families along with the heights of their parents. All measurements are in inches. Run the following
code to load the data:
require(Sleuth3)
data=ex0726
head(data)
1. Let’s start by fitting a full model with all of the variables and create the diagnostic plots.
Show your R output for the summary of this model and the diagnostic plots (if you want crop
out the bottom two plots that is OK).
m1=lm(data$Height~data$Father+data$Mother+data$Family)
windows()
par(mfrow=c(2,2))
plot(m1)
2. Are there any obvious problems with the fit of this model based on the diagnostic plots?
Explain.
As shown the residuals are randomly distributed around the 0 line thus exhibit a linear
relationship. However, there are residuals that stands out hence showing outliers in the data.
HW 11 1
Environmental Data Analysis - Assignment_1
3. Based on the model summary, one of the variables is not needed in the model? Which one and
explain why it is not useful (HINT – consider the nature of the variable).
lm(formula = data$Height ~ data$Father + data$Mother + data$Family)
Residuals:
Min 1Q Median 3Q Max
-9.081 -2.736 -0.208 2.772 11.652
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.097649 12.466751 2.254 0.0244 *
data$Father 0.302389 0.149662 2.020 0.0436 *
data$Mother 0.281597 0.051742 5.442 6.73e-08 ***
data$Family -0.003065 0.006572 -0.466 0.6410
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.392 on 929 degrees of freedom
Multiple R-squared: 0.1055, Adjusted R-squared: 0.1026
F-statistic: 36.52 on 3 and 929 DF, p-value: < 2.2e-16
The model summary indicate that Family variable recorded a very high p value of 0.641;
besides, it acted a serial number to identify the families thus it is not needed and should be
dropped
4. The following code fits three models: father and mother, father only, and mother only; and
gives the AIC for each model. Which model is the best one to fit the data? Explain your
reasoning.
m2=lm(data$Height~data$Father+data$Mother)
m3=lm(data$Height~data$Father)
m4=lm(data$Height~data$Mother)
cbind(AIC(m2,m3,m4))
df AIC
m2 4 4931.290
m3 3 4964.342
m4 3 4994.666
Among the three models it is evident that m2 (father and mother) recorded the least AIC of
4931.290 thus it is the best one to fit the data.
5. What if we use only one parent to model the height of their child? Would it be a better model
to use only the father, only the mother, or should it be based on gender (i.e., use father’s
height with sons, and mother’s heights with daughters)? Run the following code which will
HW 11 2
Environmental Data Analysis - Assignment_2

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Linear Regression and Correlation Analysis Assignment
|13
|1372
|109

Change in Unemployment Rate Question Answer 2022
|14
|2093
|36