Your All-in-One AI-Powered Toolkit for Academic Success.

+13062052269

info@desklib.com

Available 24*7 on WhatsApp / Email

Company

Tools

Support

Applied Statistics

Verified

Added on 2023/01/18

AI Summary

This document discusses applied statistics and linear regression models. It includes MATLAB code for plotting data and fitting linear models. The models are analyzed and compared, and residual plots are used to check the models. The document also provides insights into the interpretation of the models.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Running head: APPLIED STATISTICS
APPLIED STATISTICS
Name of the Student
Name of the University
Author Note

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

1APPLIED STATISTICS
Question 1:
a) The raw data along with their two linear models, one from beginning year to
1960(included) and another from year 1961 to end of year is plotted by the following
MATLAB code.
MATLAB code:
a = importdata('courseworkdata.txt',',');
year = a.data(1,:);
avg_pcap_scon = a.data(2,:);
i=1;
while year(i) <= 1960
year1(i) = year(i);
i=i+1;
end
year2 = year(i:end);
avg_pcap_scon1 = avg_pcap_scon(1:i-1);
avg_pcap_scon2 = avg_pcap_scon(i:end);
data1 = table(year1',avg_pcap_scon1','VariableNames',
{'year','average_per_capita_sugar_consumption'});
data2 = table(year2',avg_pcap_scon2','VariableNames',
{'year','average_per_capita_sugar_consumption'});
lm1 = fitlm(data1,'average_per_capita_sugar_consumption ~ 1+ year')

2APPLIED STATISTICS
figure(1)
subplot(2,1,1)
scatter(year1,avg_pcap_scon1)
lsline
xlabel('year')
ylabel('Avg per capita sugar consumption')
legend('data points','least square line','Location','best')
lm2 = fitlm(data2,'average_per_capita_sugar_consumption ~ 1 +year')
subplot(2,1,2)
scatter(year2,avg_pcap_scon2)
lsline
xlabel('year')
ylabel('Avg per capita sugar consumption')
legend('data points','least square line','Location','best')
Output:
lm1 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year

3APPLIED STATISTICS
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ __________
(Intercept) -826.05 23.135 -35.706 4.361e-56
year 0.4535 0.012174 37.252 1.0903e-57
Number of observations: 95, Error degrees of freedom: 93
Root Mean Squared Error: 3.77
R-squared: 0.937, Adjusted R-Squared 0.937
F-statistic vs. constant model: 1.39e+03, p-value = 1.09e-57
lm2 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ __________
(Intercept) 1080.7 94.324 11.457 4.4601e-16
year -0.52076 0.047433 -10.979 2.2634e-15
Number of observations: 56, Error degrees of freedom: 54

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

4APPLIED STATISTICS
Root Mean Squared Error: 5.74
R-squared: 0.691, Adjusted R-Squared 0.685
F-statistic vs. constant model: 121, p-value = 2.26e-15
Plot:
1840 1860 1880 1900 1920 1940 1960
year
0
20
40
60
Avg per capita sugar consumption
data points
least square line
1960 1970 1980 1990 2000 2010 2020
year
20
40
60
80
Avg per capita sugar consumption
data points
least square line
b) The estimated regression models shows the following.
The linear model equation till 1960 is average_per_capita_sugar_consumption = -826.05 +
0.4535*year. This particular model suggests that the average sugar consumption per capita is
increasing till 1960 as the slope is positive. Also, the R^2 value of the model is 0.937 or
93.7% of variation in average sugar consumption is explained by the change of year.
The linear model equation after 1960 to 2020 is given by,

5APPLIED STATISTICS
average_per_capita_sugar_consumption = 1080.7 -0.52076*year. Thus the model shows that
the average per capita consumption of sugar is decreasing after 1960 as the slope is negative.
The R^2 of the model is 0.691 or 69.1% of variation in the average per capita consumption of
the sugar is explained by the change in year.
c)
The model checking is performed where the models are improved with change in the
predictor variables. The improved models and the previous models are compared by the
residual plots.
MATLAB code:
%% Improved models
data1 = table(year1',(year1.^2)',(year1.^3)',avg_pcap_scon1','VariableNames',
{'year','yearsqr','yearcube','average_per_capita_sugar_consumption'});
data2 = table(year2',(year2.^2)',avg_pcap_scon2','VariableNames',
{'year','yearsqr','average_per_capita_sugar_consumption'});
lm1 = fitlm(data1,'average_per_capita_sugar_consumption ~ 1+ year + yearsqr + yearcube')
lm2 = fitlm(data2,'average_per_capita_sugar_consumption ~ 1 +year + yearsqr')
figure(4)
suptitle('Residuals for Improved model till year 1960')
subplot(2,2,1)

6APPLIED STATISTICS
plotResiduals(lm1)
% Q-Q plot for check normality
subplot(2,2,2)
plotResiduals(lm1,'probability')
% residuals versus fitted values
subplot(2,2,3)
plotResiduals(lm1,'fitted')
% auto-correlation (via lagged residuals)
subplot(2,2,4)
plotResiduals(lm1,'lagged')
figure(5)
suptitle('Residuals for Improved model from year 1961')
subplot(2,2,1)
plotResiduals(lm2)
% Q-Q plot for check normality
subplot(2,2,2)
plotResiduals(lm2,'probability')

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7APPLIED STATISTICS
% residuals versus fitted values
subplot(2,2,3)
plotResiduals(lm2,'fitted')
% auto-correlation (via lagged residuals)
subplot(2,2,4)
plotResiduals(lm2,'lagged')
Output:
Original models:
lm1 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ __________
(Intercept) -826.05 23.135 -35.706 4.361e-56
year 0.4535 0.012174 37.252 1.0903e-57

8APPLIED STATISTICS
Number of observations: 95, Error degrees of freedom: 93
Root Mean Squared Error: 3.77
R-squared: 0.937, Adjusted R-Squared 0.937
F-statistic vs. constant model: 1.39e+03, p-value = 1.09e-57
lm2 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ _______ __________
(Intercept) 1080.7 94.324 11.457 4.4601e-16
year -0.52076 0.047433 -10.979 2.2634e-15
Number of observations: 56, Error degrees of freedom: 54
Root Mean Squared Error: 5.74
R-squared: 0.691, Adjusted R-Squared 0.685
F-statistic vs. constant model: 121, p-value = 2.26e-15
Improved models:

9APPLIED STATISTICS
lm1 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year + yearsqr + yearcube
Estimated Coefficients:
Estimate SE tStat pValue
___________ __________ _______ __________
(Intercept) 0 0 NaN NaN
year -4.2229 0.6927 -6.0964 2.5826e-08
yearsqr 0.0042284 0.00072744 5.8127 8.9903e-08
yearcube -1.0502e-06 1.9094e-07 -5.5003 3.4505e-07
Number of observations: 95, Error degrees of freedom: 92
Root Mean Squared Error: 3.38
R-squared: 0.95, Adjusted R-Squared 0.949
F-statistic vs. constant model: 876, p-value = 1.31e-60

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

10APPLIED STATISTICS
lm2 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year + yearsqr
Estimated Coefficients:
Estimate SE tStat pValue
___________ _________ _________ _______
(Intercept) -650.81 13099 -0.049684 0.96056
year 1.2209 13.175 0.092664 0.92652
yearsqr -0.00043793 0.0033129 -0.13219 0.89534
Number of observations: 56, Error degrees of freedom: 53
Root Mean Squared Error: 5.79
R-squared: 0.691, Adjusted R-Squared 0.679
F-statistic vs. constant model: 59.2, p-value = 3.13e-14
Plots:

11APPLIED STATISTICS
Residuals for original model till year 1960
-9 -6 -3 0 3 6
0
0.02
0.04
0.06
0.08
0.1 Histogram of residuals
-10 -8 -6 -4 -2 0 2 4 6 8 10
Residuals
0.005
0.01
0.05
0.1
0.25
0.5
0.75
0.9
0.95
0.99
0.995
Probability
Normal probability plot of residuals
10 20 30 40 50 60 70
Fitted values
-10
-5
0
5
10
Residuals
Plot of residuals vs. fitted values
-8 -6 -4 -2 0 2 4 6 8
Residual(t-1)
-10
-5
0
5
10
Residual(t)
Plot of residuals vs. lagged residuals
Residuals for Improved model till year 1960
-12 -9 -6 -3 0 3 6
0
0.02
0.04
0.06
0.08
0.1
0.12 Histogram of residuals
-10 -8 -6 -4 -2 0 2 4 6 8 10
Residuals
0.005
0.01
0.05
0.1
0.25
0.5
0.75
0.9
0.95
0.99
0.995
Probability
Normal probability plot of residuals
0 10 20 30 40 50 60
Fitted values
-10
-5
0
5
10
Residuals
Plot of residuals vs. fitted values
-10 -8 -6 -4 -2 0 2 4 6 8
Residual(t-1)
-10
-5
0
5
10
Residual(t)
Plot of residuals vs. lagged residuals

12APPLIED STATISTICS
Residuals for original model from year 1961
-15 -10 -5 0 5 10 15
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07 Histogram of residuals
-10 -5 0 5 10 15 20
Residuals
0.005
0.01
0.05
0.1
0.25
0.5
0.75
0.9
0.95
0.99
0.995
Probability
Normal probability plot of residuals
30 35 40 45 50 55 60
Fitted values
-15
-10
-5
0
5
10
15
20
Residuals
Plot of residuals vs. fitted values
-15 -10 -5 0 5 10 15 20
Residual(t-1)
-15
-10
-5
0
5
10
15
20
Residual(t)
Plot of residuals vs. lagged residuals
Residuals for Improved model from year 1961
-15 -10 -5 0 5 10 15
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07 Histogram of residuals
-10 -5 0 5 10 15 20
Residuals
0.005
0.01
0.05
0.1
0.25
0.5
0.75
0.9
0.95
0.99
0.995
Probability
Normal probability plot of residuals
30 35 40 45 50 55 60
Fitted values
-15
-10
-5
0
5
10
15
20
Residuals
Plot of residuals vs. fitted values
-15 -10 -5 0 5 10 15 20
Residual(t-1)
-15
-10
-5
0
5
10
15
20
Residual(t)
Plot of residuals vs. lagged residuals
It is observed from the improved linear model till the year 1960 that the adjusted R^2 value is
0.949 which is more than previous R^2 value of 0.937 of the original model. Hence, the
model is improved by expanding the model fit equation with year^2 and year^3 terms.
d) The improved model till the year 1960(included) is given by the following equation.
average consumed sugar per capita = -4.2229*year + 0.00423*(year^2) – 1.05*10(-6)*year^3
Now, using this model the predicted value of the average consumed sugar per capita is first
consumed or value is 1 will be found using the following MATLAB code.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

13APPLIED STATISTICS
MATLAB code:
year = 1000:1960;
avg_sugar = -4.2229.*(year) + 0.00423.*(year.^2) - 1.05e-6.*(year.^3);
plot(year,avg_sugar)
title('Average sugar consumption per capita')
xlabel('Year')
grid on
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Year
-1200
-1000
-800
-600
-400
-200
0
200 Average sugar consumption per capita
Hence, from the curve it can be seen that the average per capita consumption of sugar is
approximately unity after the year just over 1800. Hence, sugar is first consumed at the start
of 19th century by the prediction model.

14APPLIED STATISTICS
e) Now, by historical evidence the crystallised sugar was first consumed in Great Britain in
the 12th century. This discrepancy between the model prediction and actual evidence is
occurred because the data set used has only the consumption data in the year range 1850 to
1960. Thus when this model is extrapolated beyond this range the extrapolation error occurs
as the consumption behaviour for the unknown years are not included in least square line
construction.
Question 2:
a) Now, a single linear model is added to the entire data set instead of two models two
sections of data as developed earlier. The additional variable that is constructed is year^2 and
added to the model.
MATLAB code:
a = importdata('courseworkdata.txt',',');
year = a.data(1,:);
yearsqr = year.^2;
avg_pcap_scon = a.data(2,:);
data1 = table(year',(year.^2)',avg_pcap_scon','VariableNames',
{'year','yearsqr','average_per_capita_sugar_consumption'});
lm1 = fitlm(data1,'average_per_capita_sugar_consumption ~ 1+ year + yearsqr')
figure(1)

15APPLIED STATISTICS
scatter3(year,yearsqr,avg_pcap_scon)
xlabel('Year')
ylabel('Year square')
zlabel('Average per capita sugar consumption')
hold on
xyz=[year' (year.^2)' avg_pcap_scon'];
r0=mean(xyz);
xyz=bsxfun(@minus,xyz,r0);
[~,~,V]=svd(xyz,0);
t=1:3;
r(t) = r0 + t*V(:,1);
xfit=r0(1)+t*V(1,1);
yfit=r0(2)+t*V(2,1);
zfit=r0(3)+t*V(3,1);
plot3(xfit,yfit,zfit,'r-')
figure(2)
subplot(2,2,1)
plotResiduals(lm1)
% Q-Q plot for check normality
subplot(2,2,2)

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

16APPLIED STATISTICS
plotResiduals(lm1,'probability')
% residuals versus fitted values
subplot(2,2,3)
plotResiduals(lm1,'fitted')
% auto-correlation (via lagged residuals)
subplot(2,2,4)
plotResiduals(lm1,'lagged')
Output:
lm1 =
Linear regression model:
average_per_capita_sugar_consumption ~ 1 + year + yearsqr
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ __________
(Intercept) -18257 773.89 -23.591 5.2655e-52
year 18.772 0.8011 23.433 1.1538e-51

17APPLIED STATISTICS
yearsqr -0.0048117 0.0002072 -23.222 3.3071e-51
Number of observations: 151, Error degrees of freedom: 148
Root Mean Squared Error: 5.24
R-squared: 0.865, Adjusted R-Squared 0.863
F-statistic vs. constant model: 472, p-value = 5.62e-65
Plot:
0
4.2
20
4 2050
40
Average per capita sugar consumption
2000
60
106
Year square
3.8
Year
80
1950
3.6 1900
3.4 1850

18APPLIED STATISTICS
-15-12-9 -6 -3 0 3 6 9 121518
0
0.05
0.1 Histogram of residuals
-20 -10 0 10 20
Residuals
0.0001
0.00050.001
0.0050.01
0.05
0.1
0.25
0.5
0.75
0.9
0.95
0.990.995
0.9990.9995
0.9999
Probability
Normal probability plot of residuals
0 20 40 60
Fitted values
-20
-10
0
10
20
Residuals
Plot of residuals vs. fitted values
-20 -10 0 10 20
Residual(t-1)
-20
-10
0
10
20
Residual(t)
Plot of residuals vs. lagged residuals
The parameters of this model are coefficient of year β1 = 18.772 and coefficient of yearsqr =
-0.0048117.
The model is
Average per capita sugar consumption = -18257 + 18.772*year -0.0048117*(year^2)
Now, the above model is different from the models in part 1a as in 1a only one predictor
variable is used(year) with one beta parameter and in this case two predictor variables are
used year and year^2 to two beta parameters.
b) The model is different from the previous linear model as in this model the non-linear term
year^2 is added as a predictor variable. The model has adjusted R^2 of 0.863 or 86.3% of
variation in dependent variable is explained by the independent variable. This is less than the
R^2 of 1st model developed till 1960 and more than the 2nd model developed earlier after year
1960.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

19APPLIED STATISTICS
Question 3:
a) The model fitted form is given by
Yi ~ N(β0 + β1xi,σ).
Given,
P(β0 = -826) = 1
β1 ~ U(0,1) that is it is uniformly distributed over the interval [0,1].
P(σ = 3.1) = 1
MATLAB code:
a = importdata('courseworkdata.txt',',');
year = a.data(1,:)';
avg_pcap_scon = a.data(2,:);
year = (year - min(year))./(max(year) - min(year)); % normalizing in [0,1] the year data
beta1 = fitdist(year,'Beta');
vals = pdf(beta1,a.data(1,:)');
figure
plot(year,vals)
xlabel('Year')
ylabel('Probabilities')
title('PDF of beta')

20APPLIED STATISTICS
% creating the mixed normally distributed variable
% Yi ~ N(beta0 + beta1*xi,sigma)
mu1 = [-826 -826];
sigma1 = [3.1 0;0 3.1];
mu2 = [mean(year) mean(year)];
sigma2 = [std(year) 0;0 std(year)];
r1 = mvnrnd(mu1,sigma1,length(year));
r2 = mvnrnd(mu2,sigma2,length(year));
X = [r1;r2];
gm = fitgmdist(X,2);
[logl,param] = proflik(beta1,2,'Display','on'); % log-likelihood values calculation and plot
postbeta = posterior(gm,X);
figure
scatter(X(:,1),X(:,2),10,postbeta(:,2))
ylabel(colorbar,'Posterior Probability of beta')
Output:

21APPLIED STATISTICS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Year
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Probabilities
PDF of beta
0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42
b
-163.5
-163
-162.5
-162
-161.5
-161
-160.5
-160
log likelihood
Estimate
Exact log likelihood
Wald approximation
95% confidence

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

22APPLIED STATISTICS
-1000 -800 -600 -400 -200 0 200
-900
-800
-700
-600
-500
-400
-300
-200
-100
0
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Posterior Probability of beta
b) The parameter estimate of β1 is the slope of the regression equation Yi ~ N(β0 + β1xi,σ).
Which is β1. The model fitted in the question 1 a and b has two linear model for two
segments of year data. However, the model fitted in question 3 has the posterior distribution
represented by years of two mixed normally distributed variable with same known standard
deviations. The slope β1 also follows uniform distribution in [0,1].

1 out of 23