SIT718 Real World Analytics Assessment Task 3: Problem Solving

Verified

Added on 2025/05/02

AI Summary

Desklib provides solved assignments and past papers to help students succeed.

SIT718 Real World Analytics
Assessment Task 3: Problem Solving
Using aggregation functions for data analysis
1. Understanding data
(i) See the zipped document.
(ii) See the R code
(iii) See the R code
Scatter plots:
Figure 1: Energy use of appliances (in Wh) versus Temperature in kitchen area (in Celsius)
In this graph, we see a positive relationship between the energy use of appliances and the
temperature in kitchen. That also means that there exists a positive correlation between the X1
and the response variable Y.
Figure 2: Energy use of appliances (in Wh) versus Humidity in kitchen area (in %)
Here, we see that there is a week positive relationship between the response variable Y and the
humidity in kitchen area.
Figure 3: Energy use of appliances (in Wh) versus Temperature outside (from weather station), in Celsius

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The scatter plot reveals that there is a positive relationship between the response variable and
the temperature outside (from the weather station). Notice also the presence of out-of-range
values of the response variable when the temperature is greater than 3.75.
Figure 4: Energy use of appliances (in Wh) versus Humidity outside (from weather station), given as a
percentage
From the scatter plot, it hard to see the relationship between the response variable and the
humidity outside (from the weather station).
Figure 5: Energy use of appliances (in Wh) versus Visibility (from weather station), in km
The scatter plot between the response variable and the visibility from the weather station shows
a week relationship between the two variables.
In summary, to clearly see the relationship between the independent and the response
variables, we need to compute the correlation matrix (see the table below).
X1 X2 X3 X4 X5 Y
X1 1.00 1.00 0.07 (0.07) 0.27 0.33
X2 1.00 1.00 0.07 (0.07) 0.27 0.33
X3 0.07 0.07 1.00 (0.48) 0.15 0.43
X4 (0.07) (0.07) (0.48) 1.00 (0.08) (0.02)
X5 0.27 0.27 0.15 (0.08) 1.00 0.34
Y 0.33 0.33 0.43 (0.02) 0.34 1.00
Histograms:

Figure 6: Temperature in kitchen area, in Celsius
The histogram of the temperature in kitchen area shows that the variable is not normally
distributed and does not give any information about the skewness. The shapiro test applied to
the given variable shows that it is not normal with the p-value < 0.05.
Figure 7: Humidity in kitchen area, given as a percentage
The histogram of the humidity in kitchen shows that there the data is normally distributed around
its mean. The p-value of the test is greater than 0.05 which confirms that normality of the
variable.
Figure 8: Temperature outside (from weather station), in Celsius
The shape of the histogram reveals that the temperature outside (from the weather station) is
normally distributed. Here also, the p-value of the normality test is greater than 0.05.

Figure 9: Humidity outside (from weather station), given as a percentage
The data seems to be normally distributed but the Shapiro test shows that it’s not (with p-value
< 0.05).
Figure 10: Visibility (from weather station), in km
We notice that the visibility (in Km), from the weather station, is not normally distributed (p-value
< 0.05) but is right-skewed.
Figure 11: Y: Energy use of appliances, in Wh
The response variable is right-skewed. The normality test obviously shows that Y is not normally
distributed.
(i) Given that there exist some negative values in the data set, the transformation
we applied for the selected data is the normalization which is coded in the R
script. The normalization transforms the data such that they are framed in [0,1].
2. Transform the data
(i) Five variable:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure 12: Temperature outside
Figure 13: Humidity outside
Figure 14: Visibility in Km
Figure 15: Energy use of appliances
(ii) Description of the four selected variables and the response variable

Figure 16: Temperature outside (from weather station), in Celsius
The shape of the histogram reveals that the temperature outside (from the weather station) is
normally distributed. Here also, the p-value of the normality test is greater than 0.05.
Figure 17: Humidity outside (from weather station), given as a percentage
The data seems to be normally distributed but the Shapiro test shows that it’s not (with p-value
< 0.05).
Figure 18: Visibility (from weather station), in km
We notice that the visibility (in Km), from the weather station, is not normally distributed (p-value
< 0.05) but is right skewed (Cohen, 2015).

Figure 19: Y: Energy use of appliances, in Wh
The response variable is right-skewed. The normality test obviously shows that Y is not normally
distributed.
(i) Given that there exist some negative values in the data set, the transformation
we applied for the selected data is the normalization which is coded in the R
script. The normalization transforms the data such that they are framed in [0,1].
3. Build models and investigate the importance of each variable
(i) See the working directory
(ii) See the R script
(iii) The tables
WAM WPM05 WPM2 OWA Choquet
RMSE 0.43913 0.43964 0.43904 0.43935 0.43853
av.l1error 0.32175 0.32160 0.32208 0.32169 0.32146
Pearson correlation 0.51359 0.52894 0.50157 0.47410 0.44302
Spearman correlation 0.48448 0.49539 0.47703 0.46805 0.46146
WAM WPM05 WPM2 OWA Choquet
X1 0.62073 0.59360 0.60432 0.09564 0.59237
X3 0.25276 0.27047 0.23892 0.30705 0.10587
X4 - 0.02376 - 0.59731 0.07940
X5 0.12652 0.11217 0.15676 - 0.22236
(iv) The comments
a. From the table of errors, we observe that the errors are very close and the best
model among the 5 is the Choquet model that reveals a very small root mean
square error (Branke, 2016).
b. From the table of weights, we see that the X4 variable is not statistically
significant.
c. The models do not consider the interaction between variables.
d. The better models favour higher inputs (Angilella, 2016).
4. Use your model for prediction
(ii) The best model is the Choquet model
(iii) Using the best model, the prediction of the Energy use of appliances as follows:
We apply the same transformation to the test data and we have
X1 X3 X4 X5
[1,] 4.165114 3.621671 4.314818 3.446808
log Y =0.59236985× 4.165114+ 0.10586753× 3.621671+0.07940065× 4.314818+0.22236196 ×3.446808
¿ 3.959743
Thus, y=52.44385 which is reasonable because the nearest neighbor of the provided test data
has a response variable equal to 50.
References

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Angilella, S., Corrente, S., Greco, S. and Słowiński, R., 2016, Robust Ordinal Regression and
Stochastic Multiobjective Acceptability Analysis in multiple criteria hierarchy process for the
Choquet integral preference model, Omega, Volume 63, Pages154-169.
Branke, J., Corrente, S., Greco, S., Słowiński, R. and Zielniewicz, P., 2016, Using Choquet
integral as a preference model in interactive evolutionary multiobjective optimization, European
Journal of Operational Research, Volume 250, Number 3, Pages 884-901.
Cohen, J.E. and Xu, M., 2015, Random sampling of skewed distributions implies Taylor’s power
law of fluctuation scalin, Proceedings of the National Academy of Sciences, Volume 112,
Number 25, Pages 7749-7754.