Climate Data Analysis and Modeling

Verified

Added on  2020/03/04

|22
|3176
|50
AI Summary
The assignment presents a dataset of climate information, focusing on rainfall and temperature records spanning several decades. It requires students to analyze this data, identifying key trends such as increases or decreases in rainfall, fluctuations in minimum and maximum temperatures, and any significant shifts over time. The analysis should highlight the highest and lowest recorded values for each metric, as well as average values. The report also encourages exploration of potential future climate projections based on the observed historical patterns.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running head: ASSIGNMENT 2
Assignment 2
Name of the Student
Name of the University
Author Note

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
1
ASSIGNMENT 2
Table of Contents
Task 1.........................................................................................................................................2
Task 1.1..................................................................................................................................2
Task 1.2..................................................................................................................................3
Task 2.........................................................................................................................................5
Task 2.1..................................................................................................................................5
Task 2.2................................................................................................................................13
Task 3.......................................................................................................................................17
References................................................................................................................................19
Document Page
2
ASSIGNMENT 2
Task 1
Task 1.1
Data Warehouse: A data warehouse is a kind of relational database designed for
specific query and data analysis. It is not used for regular transactional processing of the data.
Historical data collected from different sources are collected from different transactional data
and other sources. A data warehouse helps the organization to separate out the analysis
workload from the transactional workload of the servers (Kimball 2013). Apart from the
analysis capabilities, a data warehouse also has the capability to do data extraction,
transportation, data transformation and data loading solutions. It also includes an online
analytical processing engine (OLAP), client data analysis tools and applications, which are
used to process the gathering of information and to deliver it to the users. A data warehouse is
designed to help in analyzing of the information collected from the sources. To learn more
about a department of an organization, they can invest in a data warehouse, which will
analyses the information collected from the department (Vaisman and Zimányi 2014). The
ability to analyses the information in a section wise manner helps the warehouse to be subject
oriented in nature. A data warehouse has the properties of being subject oriented in nature,
has data integration procedures, stores time variant information, and has storage for
nonvolatile information. To implement a correct data warehouse the organizations must
follow correct design mechanism.
Data Lake: A data lake is a new generation of data storage procedure that has been
developed to meet the new emerging trends in data analysis. It can be defined as a temporary
storage area for data being collected from the online resources for the analysis of the
organization. The data collected is just dropped into the data lake accompanied by a unique
identifier. This identifier can be used to identify the data that it holds. The identifier ca be
Document Page
3
ASSIGNMENT 2
compared to being a metadata tag of the information collected (Miloslavskaya and Tolstoy
2016). When data analysis is done of the information, the identifiers are called upon by using
a query. The relevant information is collected and the result is returned. The data fetched is
analyzed and a compact decision is provided. The term Data Lake is coined it the Hadoop
oriented object storage. Using a data lake can provide effective information during data
analysis or when data mining is done on the organization (Fang 2015). The concept of a data
lake is a new trend in the digital world and is being slowly accepted. As a data lake is a large
storage of information there is no need to follow any schema for designing the storage facility
of the database.
Data Mart: Data mart is a small version of data warehouse that is used by a certain
class of workers to store their data analysis information. The term is often misused with data
warehouse, but they are very different terms (Ramos, Alturas and Moro 2017). However,
they might to the same work but the working environment is different. For a larger
organization, there is always the option of using a data warehouse. However, the use of a data
mart concept in new it is slowly being accepted into the digital world (Golfarelli and Rizzi
2013).
Task 1.2
Data Warehouse: Data has been stored in a data warehouse at a very granular level of
details. During analysis, all information related to the query is extracted, changed and loaded.
This means that the information is first extracted from the sources and changed into a
common format for the warehouse to read it (Ross et al. 2014). The revised information is
then loaded into the database to continue analyzing. When a query is sent to the data
warehouse, it first locates the information from the warehouse and retrieves the data. It then
presents the information in an integrated view for the user to view. A warehouse provides a
better form of query support than the traditional database. The warehouse has access to

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
4
ASSIGNMENT 2
enhanced spreadsheet functions, structured and faster query processing, and data mining and
efficient viewing. The enhanced spreadsheet function helps the organization to view the
analyzed data in a better view. An organization should have a data warehouse for doing
competitive and comparative historical data analysis, to get real time analysis of financial
information of the organization, to simplify the data processing methods, to identify the
competitive market trends and to reduce the cost in the operations of the organization
(Kimball and Ross 2013). Most of the organizations can benefit from the use of a data
warehouse.
Data Lake: A data lake helps an organization to analyze data and information of
different variety and volume of the data (O'Leary 2014). To implement a successful data lake
implementation an organization has to use different tools to collect the information from
multiple data sources. They also have to keep in mind they need to do the data collection in a
domain specific information. Searching of information in different department would cause
confusion, as the only identifier of the data is a metadata tag. There should also be an
implementation of an automated management of the metadata information. The data lake
should have the ability to scan out the new incoming information into categories, tag them
and store them in the database (Roski, Bo-Linn and Andrews 2014). Following these steps,
an organization will be able to implement a data lake in their organization. The schema which
a traditional a database follows is absent in such a data lake which make the implementation
easier. Data analysis on an experimental basis can also be done on the data stored in the lake.
Data Mart: A data mart is targeted for a department in an organization; data analysis
is easier on the information stored in the data mart. A large organization can save resources
and time by analyzing the information department wise (Rahman, Riyadi and Prasetyo 2015).
The final analysis data can be clubbed to form a better-detailed information. The data mart
use the OLAP feature of the data warehouse to do data analysis of the information. Using a
Document Page
5
ASSIGNMENT 2
data mart in an organization is helpful because the load of analyzing a data warehouse is
shared between the data marts. It produces authorize able different subsets of the data
warehouse. It can be used to analyze the return of investment of a department of an
organization. The data mart provides savings by reducing time consumed for the analysis of
the information (Zhu et al. 2015). If the data mart is not used in the right manner then the
whole s\data warehouse can collapse.
Task 2
Task 2.1
The following set of images show the charts created using Rapid Miner:
Figure 1: Initiating the connection between the database and the software
Document Page
6
ASSIGNMENT 2
Figure 2: The chart tab of the database
The following scatter graphs have been plotted keeping the Quality of the white wine
against all the different variables of the wine.
Figure 3: Scatter graph of Quality VS fixed acidity

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
7
ASSIGNMENT 2
Figure 4: Scatter graph of Quality VS volatile acidity
Document Page
8
ASSIGNMENT 2
Figure 5: Scatter graph of Quality VS citric acid
Document Page
9
ASSIGNMENT 2
Figure 6: Scatter graph of Quality VS residual sugar
Figure 7: Scatter graph of Quality VS chlorides

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
10
ASSIGNMENT 2
Figure 8: Scatter graph of Quality VS free sulfur dioxide
Figure 9: Scatter graph of Quality VS total Sulfur dioxide
Document Page
11
ASSIGNMENT 2
Figure 10: Scatter graph of Quality VS density
Document Page
12
ASSIGNMENT 2
Figure 11: Scatter graph of Quality VS pH
Figure 12: Scatter graph of Quality VS Sulphates

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
13
ASSIGNMENT 2
Figure 13: Scatter graph of Quality VS alcohol
Figure 14: Setting up the process for the correlation table
Document Page
14
ASSIGNMENT 2
Figure 15: The final correlation table from the data set
The report is targeted for the analysis of the top five variables, which determine the
quality of the white wine. The quality of the wine needs to be plotted against the other
variables of the wine samples. The different properties of the scatter graphs have been plotted
against quality. The quality has been plotted in the x-axis of the graph and the rest of the
variables on the y-axis. The plots vary in color keeping the positioning in mind. The farther
away points are colored red, then green, yellow and the points closest to the origin and the
axis is colored blue. The best points in the graph are therefore are the red spots. Analyzing
the graphs it can be seen that the alcohol, Sulphates, total Sulfur dioxide, fixed acidity and
volatile acidity seems to be the top five variables, which can be used to determine the quality
of the white wine. Thus the graph with the most amount of red spots are: Quality VS alcohol,
Quality VS Sulphates, Quality VS total Sulfur dioxide, Quality VS fixed acidity and Quality
VS volatile acidity. This is the initial assumption from the graphs.
After making the correlation table, it became clearer about the top five variables,
which can be chosen for making the quality of the wine better. The higher the value of the
attribute weight it would be better suited for using in determining the quality of the wine.
Thus looking at the table the top five variables come to alcohol, density, chlorides, volatile
acidity and total sulfur dioxide. These would be the top five variables that can be used to
determine the quality of the white wine.
Task 2.2
The linear regression table has been created in Rapid Miner and the results have been
shared below:
Document Page
15
ASSIGNMENT 2
Figure 16: Setting up the software to find the linear regression
To make the final regression model the following steps have to be followed:
1. Drag the database from the repository and drop it into the process area.
2. Next, set the operator on which the linear regression has to be done. For this method,
the quality has to be selected as a part of the model. Therefore, the role needs to be set
to the quality variable.
3. Search for set role in the operator’s tab. Drag and drop the operator into the process
area.
4. Connect the out of the data set of white wine to the set exa input node of the set role
operator.
5. To specify which variable needs to be worked on change the attribute name to quality
and the label to target.
6. Next search for linear regression in the operators tab and drag the linear regression
attribute on to the process area.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
16
ASSIGNMENT 2
7. Connect the exa output of the set role operator to the linear regression tra node.
8. Connect the mod node of the linear regression operator to the res node of the process
area.
9. Press F11 on the keyboard to run the process.
10. The table shows the linear regression performed on the white wine database.
Figure 17: The chart showing the summary table of the results from the linear
regression.
The linear regression equation is:
Y=A + BX1 + CX2 + DX3 + EX4 + FX5 + GX6 + HX7 + IX8 + JX9 + KX10
Here in the equation,
Y is the dependent variable of the data set,
X1, X2, X3, X4, X5, X6, X7, X8 and X9 are the independent variables of the dataset
A is intercept
Document Page
17
ASSIGNMENT 2
B, C, D, E, F, G, H, I, J are the coefficient of the independent variables.
Where Y=quality,
X1=fixed acidity, X2=volatile acidity, X3=residual acidity, X4=chlorides, X5=free
sulfur dioxide, X6=total sulfur dioxide, X7=density, X8=pH, X9=sulphates and
X10=alcohol
A=149.900, B=0.066, C=-1.868, D=0.081, E=-0.234, F=0.004, G=-0.000, H=-
149.986, I=0.684, J=0.632 and K=0.194
Therefore, the linear regression equation becomes:
Quality = 149.900 + 0.066*fixed acidity – 1.868*volatile acidity + 0.081*residual sugar –
0.234*chlorides + 0.004*free sulfur dioxide – 0.000*total sulfur dioxide – 149.986*density +
0.684*pH + 0.632*sulphates + 0.194*alcohol
Now if all the other variables except intercept (A) and BX1 is zero (0) then the equation
becomes:
Y=A + BX1
If A=0, then Y = BX1
Looking from the above equation it is clear that Y is directly proportional on X1, or in other
words, the quality of the wine is directly related to the different variables of the wine. On
increase in the value of the coefficient of variables, the quality increases with the similar
multipliable coefficient and if the value of the coefficient is a negative term then the quality
decreases in a similar manner.
Substituting the values into Y = BX1 we get Quality = 0.066*fixed acidity, which shows that
for a single unit change in the fixed acidity the quality of the wine increases by a factor of
0.066.
Document Page
18
ASSIGNMENT 2
Following the above process if we take Y = CX2 we get Quality = -1.868*volatile acidity,
which shows that for a unit change in the volatile acidity of the wine there is a drop in the
quality of the wine by a factor of 1.868.
For the given data set of the samples of white wine, the quality of the white wine increases
with the effect of fixed quality, residual sugar, free sulfur dioxide, pH, sulphates and alcohol.
For the change in the value of volatile acidity, chlorides, total sulfur dioxide and density the
quality of the white wine decreases.
Task 3
The data set provided displays the snowfall at Whistler BC Canada. The files shows
other relevant information related to the weather conditions from 1972 to 2009. The table
provided us with the daily information about the maximum, minimum and the mean
temperatures, total amount of rainfall, snowfall and precipitation and the temperature of the
area. The following diagram gives the view of the different graphs created on Tableau:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
19
ASSIGNMENT 2
Figure 1: The Tableau charts created from the data file
The above figure shows the total snowfall, total rain, total precipitation, average of the
minimum temperatures, average of the maximum temperatures and the average temperatures
over the period of years. The average snowfall throughout the period of time has been over
600cm. 1999 had the highest amount of snowfall recorded in the area of 1,564cm. A slight
decrease in the rainfall has been observed between the years of 1986 to 2004. There was
again a huge decrease in rainfall after 2005. The highest recorded rainfall was 735.9cm. The
period of time had an average rainfall of 700mm. The average minimum temperature
recorded for the area was below 0oC. The lowest average minimum temperature recorded was
around -9.529oC. There was a fluctuation in the maximum temperature but reached above 0oC
after 1980. The average maximum temperature declined after 1992, and again rose up after
2005. The lowest average temperature recorded for the period was 15.752oC.
Document Page
20
ASSIGNMENT 2
References
Fang, H., 2015, June. Managing data lakes in big data era: What's a data lake and why has it
became popular in data management ecosystem. In Cyber Technology in Automation,
Control, and Intelligent Systems (CYBER), 2015 IEEE International Conference on (pp. 820-
824). IEEE.
Golfarelli, M. and Rizzi, S., 2013. Data warehouse testing. Developments in Data Extraction,
Management, and Analysis, pp.91-108.
Kimball, R. and Ross, M., 2013. The data warehouse toolkit: The definitive guide to
dimensional modeling. John Wiley & Sons.
Kimball, R., 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional
Modeling E-Books.
Miloslavskaya, N. and Tolstoy, A., 2016, August. Application of Big Data, Fast Data, and
Data Lake Concepts to Information Security Issues. In Future Internet of Things and Cloud
Workshops (FiCloudW), IEEE International Conference on (pp. 148-153). IEEE.
O'Leary, D.E., 2014. Embedding AI and crowdsourcing in the big data lake. IEEE Intelligent
Systems, 29(5), pp.70-73.
Rahman, L., Riyadi, S. and Prasetyo, E., 2015. Development of Student Data Mart Using
Normalized Data Store Architecture. Advanced Science Letters, 21(10), pp.3225-3229.
Ramos, J., Alturas, B. and Moro, S., 2017, June. Business intelligence in a public institution
—Evaluation of a financial data mart. In Information Systems and Technologies (CISTI),
2017 12th Iberian Conference on (pp. 1-6). IEEE.
Document Page
21
ASSIGNMENT 2
Roski, J., Bo-Linn, G.W. and Andrews, T.A., 2014. Creating value in health care through big
data: opportunities and policy implications. Health affairs, 33(7), pp.1115-1122.
Ross, T.R., Ng, D., Brown, J.S., Pardee, R., Hornbrook, M.C., Hart, G. and Steiner, J.F.,
2014. The HMO Research Network Virtual Data Warehouse: a public data model to support
collaboration. EGEMS, 2(1).
Vaisman, A. and Zimányi, E., 2014. Data Warehouse Systems: Design and Implementation.
Springer.
Zhu, Q., Liu, Y., Guo, S., Liu, S., Wang, G., Yan, S. and Tong, K., Linkedin Corporation,
2015. Data mart for machine learning. U.S. Patent Application 14/986,599.
1 out of 22
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]