Data Analysis and Optimization: Linear Regression, Clustering, MATLAB

Verified

Added on 2022/10/10

AI Summary

This assignment focuses on data analysis and optimization techniques using MATLAB. It begins with a linear regression analysis to predict the residuary resistance of sailing yachts, employing the `fitlm` function and evaluating the model's significance and adjusted R-squared. The solution then divides the data into training and testing sets, refining the model by removing insignificant variables and improving the coefficient of determination. The assignment proceeds to explore ridge and lasso regression for blog data, identifying optimal parameter values through MSE minimization. Finally, the solution utilizes K-means clustering to analyze power consumption data and wireless sensor data, visualizing the clusters and centroids to understand data patterns. MATLAB code snippets and outputs are provided throughout the solution to illustrate each step of the analysis.

Running head: DATA ANALYSIS AND OPTIMIZATION
DATA ANALYSIS AND OPTIMIZATION
Name of the Student
Name of the University
Author Note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1DATA ANALYSIS AND OPTIMIZATION
Problem 1:
1. Fitlm function is used in MATLAB for predicting the resistance per unit weight of
displacement by the function of other variables in the given data.
MATLAB code:
opt = detectImportOptions('yacht.txt'); opt.VariableNames =
{'Resistance','Long','Prismcoeff','LDratio','BDratio','LBratio','Frnum'};
hulldata = readtable('yacht.txt',opt);
linmod = fitlm(hulldata,'Resistance ~ Long + Prismcoeff + LDratio + BDratio + LBratio +
Frnum')
Output:
Linear regression model:
Resistance ~ 1 + Long + Prismcoeff + LDratio + BDratio + LBratio + Frnum
Estimated Coefficients:
Estimate SE tStat pValue
(Intercept) -19.237 27.113 -
0.70949
0.47857
Long 0.19384 0.3380
7
0.57338 0.56681
Prismcoeff -6.4194 44.159 -
0.14537
0.88452
LDratio 4.233 14.165 0.29883 0.76527
BDratio -1.7657 5.5212 -0.3198 0.74934

2DATA ANALYSIS AND OPTIMIZATION
LBratio -4.5164 14.2 -
0.31806
0.75066
Frnum 121.67 5.0658 24.018 6.21E-
72
Number of observations: 308, Error degrees of freedom: 301
Root Mean Squared Error: 8.96
R-squared: 0.658, Adjusted R-Squared 0.651
F-statistic vs. constant model: 96.3, p-value = 4.53e-67
The overall p value of the model is very close to zero which makes the model significant,
however the adjusted R^2 of the model is not satisfactory as only 65.1% of the variation in
the dependent variable, resistance per unit weight of displacement is explained by its
predictors. Hence, the model is not exactly valid for predicting accurate values within its
regression range.
2. a) Now, the whole data is divided in 80%:20% ratio where 805 is used for training and the
trained model is tested on the rest 20% of data.
MATLAB code:
trainlen = round(0.8*length(hulldata.Resistance));
trainmodel = fitlm(hulldata(1:trainlen,:),'Resistance ~ Frnum')
% testing with trained model
Frnum = hulldata{trainlen + 1:length(hulldata.Frnum),{'Frnum'}};
resistact = hulldata{trainlen + 1:length(hulldata.Resistance),{'Resistance'}};

3DATA ANALYSIS AND OPTIMIZATION
resisttrain = -24.14 + 120.36.*Frnum;
plot(Frnum,resistact,'b-',Frnum,resisttrain,'k:')
title('testing of trained model')
legend('Actual data','Predicted data by trained model')
xlabel('Froude number')
ylabel('Residuary resistance per unit weight of displacement')
grid on
Output:
trainmodel =
Linear regression model:
Resistance ~ 1 + Frnum
Estimated Coefficients:
Estimat
e
SE tStat pValue
(Intercept) -24.14 1.6913 -
14.273
5.18E-
34
Frnum 120.36 5.5952 21.512 2.82E-
58
Number of observations: 246, Error degrees of freedom: 244
Root Mean Squared Error: 8.82

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4DATA ANALYSIS AND OPTIMIZATION
R-squared: 0.655, Adjusted R-Squared 0.653
F-statistic vs. constant model: 463, p-value = 2.82e-58
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Froude number
-10
0
10
20
30
40
50
60
70
Residuary resistance per unit weight of displacement testing of trained model
Actual data
Predicted data by trained model
The above results shows that the trained model is not an appropriate fit for the actual
resistance data the residuary resistance pattern is somewhat non-linear and hence it is
properly explained by the least square trained model.
b) All the variables are not important for the model as the p values of all the predictors are
not significant. Here, we have chosen the significance level to 0.05 and thus only Froude
number is significant with p value less than 0.05. Hence, in training model all the variables
except Froude number are removed as predictors and the model coefficient of determination
improved to 65.3%.
Problem 2:

5DATA ANALYSIS AND OPTIMIZATION
1. The nobowdata is accessed and only the variables which have close values or closely
expected correlation with the target variable CommentsInNext24. Then ridge and Lasso
regression is applied with the selected best values of parameters corresponding to minimum
MSE value.
MATLAB code:
clc
clear
nobowtrain = readtable('blogDatanoBowtrain.csv','ReadVariableNames',1);
nobowtest = readtable('blogDatanoBowtest.csv','ReadVariableNames',1);
X = [nobowtrain{:,17:24}]; % training ydata
y = nobowtrain.CommentsInNext24; % training xdata
X_test = [nobowtest{:,17:24}];
ytest = nobowtest.CommentsInNext24;
%% Ridge regression
b = ridge(y,X,0:2:50,0);
yhat = b(1,1) + X*b(2:end,1);
mse = mean((yhat(~isnan(y)) - y(~isnan(y))).^2);
errors = zeros(length(b),1);
for i=1:length(b)
yhat = b(1,i) + X*b(2:end,i);

6DATA ANALYSIS AND OPTIMIZATION
errors(i) = mean((yhat(~isnan(y)) - y(~isnan(y))).^2);
end
[minval, minindex] = min(errors); % finding index for best value of k in ridge regression
fprintf('The best value of k for Ridge regression which is selected is %.4f \n',b(minindex))
b_best = ridge(y,X,b(minindex),0);
yhat = b_best(1) + X_test*b_best(2:end);
%% Lasso regression
[B,FitInfo] = lasso(X,y,'CV',5);
% selecting best labmda value for lasso where the MSE is within 1 sd of min(MSE)
minMSEModel = B(:,FitInfo.IndexMinMSE);
bestlambdaval = FitInfo.LambdaMinMSE;
fprintf('The best lambda value which is selected for Lasso model is %.4f \n',bestlambdaval)
coef0 = minMSEModel(1);
yhat = X_test*minMSEModel + coef0;
Output:
The best value of k for Ridge regression which is selected is 3.8883
The best lambda value which is selected for Lasso model is 1.5814
2 and 3:
Now, the bowdata is accessed and the 200 bow data variables are summed up to make it one
predictor variables and all the other predictors remained the same.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7DATA ANALYSIS AND OPTIMIZATION
MATLAB code:
clc
clear
bowtrain = readtable('blogDatatrain.csv','ReadVariableNames',1);
bowtest = readtable('blogDatatest.csv','ReadVariableNames',1);
bow200avg = sum(bowtrain{:,27:226},2);
X = [bowtrain{:,17:24} bow200avg]; % training ydata
y = bowtrain.CommentsInNext24; % training xdata
bow200testavg = sum(bowtest{:,27:226},2);
X_test = [bowtest{:,17:24} bow200testavg];
ytest = bowtest.CommentsInNext24;
%% linear regression
linmodel = fitlm(X,y)
% taking only 2nd variable Comments24BeforeBasetime for prediction variable as this is
only significant
ypred = 6.8929 + 0.46591.*X_test(:,2);
fprintf('The adjusted R^2 value for linear regression is %d \n',linmodel.Rsquared.Adjusted)
figure(1)
plot(X_test(:,2),ypred,'b-',X_test(:,2),ytest,'ro')
title('Comparison by linear model')

8DATA ANALYSIS AND OPTIMIZATION
legend('Predicted number of comments in next 24 hours','Actual comments in next 24
hours','Location','best')
%% Ridge regression
b = ridge(y,X,0:2:50,0);
yhat = b(1,1) + X*b(2:end,1);
mse = mean((yhat(~isnan(y)) - y(~isnan(y))).^2);
errors = zeros(length(b),1);
for i=1:length(b)
yhat = b(1,i) + X*b(2:end,i);
errors(i) = mean((yhat(~isnan(y)) - y(~isnan(y))).^2);
end
[minval, minindex] = min(errors); % finding index for best value of k in ridge regression
fprintf('The minimum MSE in Ridge regression is %.4f \n',minval)
fprintf('The best value of k for Ridge regression is %.4f \n',b(minindex))
b_best = ridge(y,X,b(minindex),0);
yhat = b_best(1) + X_test*b_best(2:end);
figure(2)
scatter(ytest,yhat)
hold on
plot(ytest,ytest)

9DATA ANALYSIS AND OPTIMIZATION
title('Comparison of Actual data and fitted model by Ridge regression')
xlabel('Actual comments in next 24 hours')
ylabel('number of comments in next 24 by Predicted model')
hold off
%% Lasso regression
[B,FitInfo] = lasso(X,y,'CV',5);
% selecting best labmda value for lasso where the MSE is within 1 sd of min(MSE)
minMSEModel = B(:,FitInfo.IndexMinMSE);
bestlambdaval = FitInfo.LambdaMinMSE;
fprintf('The best lambda value which is selcted for Lasso model is %.4f \n',bestlambdaval)
coef0 = minMSEModel(1);
yhat = X_test*minMSEModel + coef0;
figure(3)
scatter(ytest,yhat)
title('Comparison by Lasso regression')
hold on
plot(ytest,ytest)
xlabel('Actual comments in next 24 hours')
ylabel('Predicted data')
hold off

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

10DATA ANALYSIS AND OPTIMIZATION
Output:
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
Estimated Coefficients:
Estimate SE tStat pValue
(Intercept) 6.8929 4.9193 1.4012 0.16176
x1 0.078727 0.15485 0.50842 0.61138
x2 0.46591 0.11239 4.1454 3.97E-
05
x3 0.024591 0.11183 0.2199 0.82604
x4 -0.13485 0.17828 -0.75636 0.44978
x5 -8.8323 17.843 -0.49501 0.6208
x6 -0.4721 9.2213 -
0.051196
0.95919
x7 -0.29799 8.2096 -
0.036298
0.97106
x8 10.035 19.892 0.50448 0.61414
x9 -0.30524 0.34399 -0.88734 0.37531
Number of observations: 524, Error degrees of freedom: 514
Root Mean Squared Error: 75.1
R-squared: 0.0935, Adjusted R-Squared 0.0776

11DATA ANALYSIS AND OPTIMIZATION
F-statistic vs. constant model: 5.89, p-value = 7.69e-08
The adjusted R^2 value for linear regression is 7.759369e-02
The minimum MSE in Ridge regression is 5538.1588
The best value of k for Ridge regression is 6.8929
The best lambda value which is selected for Lasso model is 1.3129
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300 Comparison by linear model
Predicted number of comments in next 24 hours
Actual comments in next 24 hours