Data Analysis and Visualization Project using RapidMiner and Tableau

Verified

Added on 2025/04/27

AI Summary

Desklib provides past papers and solved assignments for students. This project covers data lakes, EDA, linear regression, and Tableau.

CIS 8008 Assignment-2
Student name: Anurag Gangasani
Student id: U1101669

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Task 1: Data Lakes and Data Warehousing.....................................................................................3
1.1 Provide a concise definition of data lake in the context of an organizational approach to
data management.........................................................................................................................3
1.2 Explain two advantages and two disadvantages of deploying data lake as part of an
organizational data management strategy....................................................................................4
Task-2 Exploratory Data Analysis and Linear Regression Analysis...............................................5
2.1 Conduct an exploratory data analysis (EDA) using RapidMiner Studio...............................5
2.2 Build a Linear Regression model for predicting salary of a person....................................10
Task-3 Tableau Desktop view of salary data set...........................................................................14
3.1 Create Tableau Text Table or Graph View that displays salary values, average hours per
week, and other relevant data....................................................................................................14
3.2 Create Tableau Text Table or Graph View that displays salary values, education level, and
other relevant data......................................................................................................................15
References:....................................................................................................................................16
1

List of figures
Figure 1: Data Lake.........................................................................................................................3
Figure 2: Imported data...................................................................................................................5
Figure 3: Selected attributes............................................................................................................6
Figure 4: Dataset of selected attribute.............................................................................................7
Figure 5: Statistics of selected attribute...........................................................................................7
Figure 6: Result 1- Sib and married chart........................................................................................8
Figure 7: Result 2- frequency v/s value of sib.................................................................................9
Figure 8: Result 3-education, married, and Wexperience.............................................................10
Figure 9: Process design................................................................................................................11
Figure 10: Selected attributes........................................................................................................11
Figure 11: Data splitting................................................................................................................12
Figure 12: Data obtained from linear regression...........................................................................13
Figure 13: Linear regression description.......................................................................................13
Figure 14: Age, salary, and hours chart.........................................................................................14
Figure 15: Salary and education chart...........................................................................................15
2

Task 1: Data Lakes and Data Warehousing
1.1 Provide a concise definition of data lake in the context of an organizational
approach to data management.
A data lake is a unified repository that permits to store all the unstructured and structured data at
any scale. In data lake, data can be stored in any form which means there is no need to first
change into structured form and run into separate analytics style from visualizations and
dashboards to large processing, machine learning, and real-time analytics to obtain better
decisions.
Figure 1: Data Lake
(Source: AWS, 2019)
The business value that is successfully generated by the organizations from their data will
outpace their peers. According to the Aberdeen Survey, those organizations who used and
implemented Data Lake had achieved the 9% growth in organic revenue. Such organizations
were able to perform such types of analytics like machine learning over sources such as internet
associated device, log files stored in the data lake. The data lake obtains query outcomes faster
with the low-budget storage and the quality of data is both i.e. raw data and structured data. From
the organizational context, data lake provides allows importing any type of data that is collected
from the several sources and shift in the actual form into the data lake. This process enables to
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

measure the data of any size with time-saving of a schema, data structures, and transformations
(AWS, 2019).
1.2 Explain two advantages and two disadvantages of deploying data lake as part of
an organizational data management strategy
Data lake enables the organization to store any kind of data whether raw data or structured data
and perform large analytics. Since there is a rapid increment in the variety and volume of the
stored data, it becomes an emerging data management strategy. With the help of data lake,
organizations can also store relational data such as operational database and non-relational data
such as IoT devices, mobile applications, and social media. Data lake also provide the ability to
have an understanding of the type of data being stored through cataloging, crawling, and
indexing. It also ensures that the data which is stored is protected and secured. There are many
advantages for the organizations to use the data lake, but among all the two main advantages
are:
 Enhance customer interaction: Data lake allows to combine the data of customer from
CRM platform along with the analytics of social media which is a marketing platform. It
comprises of purchasing history and incident tickets that authorizes the organizations for
understanding the most profitable customer, rewards, customer churn, and promotions, and
thus, it will increase the loyalty (Mark, 2017).
 Increase operational efficiency: There are several ways that are introduced by IoT (Internet
of Things) to gathers the data on processes such as manufacturing with the actual data
imported from the devices connected to the internet. With the help of data lakes, it becomes
easy to save and run the analytics on the IoT data generated on the machine and it also helps
in discovering the approached to minimize the operational budget and hence enhance the
efficiency and quality.
With the many advantages, data lake has some disadvantages also and two disadvantages are:
 In the architecture of data lake, the raw data which is stored has no oversight of the
information means there is no proper or defined catalog to define the mechanism for reefing
and filtering the data in order to make it secure.
4

 Another disadvantage is Data Swamp which means there is no filter mechanism in data lake,
due to which it becomes difficult to search the data among a large amount of stored data.
Also, the data stored is sometimes inconsistent and have no proper access controls (AWS,
2019).
Task-2 Exploratory Data Analysis and Linear Regression Analysis
2.1 Conduct an exploratory data analysis (EDA) using RapidMiner Studio.
EDA (Exploratory Data Analysis) is an approach for the analysis of a large amount of the data
that employs several techniques to extract variables, examine underlying assumptions, uncover
the structure, define optimal measures settings, maximize understanding of data set, and detect
anomalies. The focus of EDA is to determine how data can be accepted by performing the
analysis and is mainly used to plot the raw data like histograms; plotting statistics like standard
deviation plots etc. (Shelby 2018).
For performing the EDA on Salary.csv dataset, RapidMiner tool is used. Firstly, data stored in
the data set is imported with the import file option.
Figure 2: Imported data
The above-shown picture illustrates the data of dataset Salary.csv which is being imported.
5

Now the next step is to select the attributes among which the analysis has to be performed. For
this, two operators are used, one is Retrieve Salary.csv and another is Select Attribute. In select
attribute operator, the following operators are selected:
1. Hours
2. Education
3. Wexperience
4. Married
5. South
6. Sibs
Figure 3: Selected attributes
Both are connected to each other via and after running, their datasets are obtained as:
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure 4: Dataset of a selected attribute
Statistics are also obtained after performing the analysis among these selected attributes.
Figure 5: Statistics of a selected attribute
7

Then some comparison is performed between these attributes and the following results are
obtained:
1. Siblings-married chart
Figure 6: Result 1- Sib and married chart
The above-shown image illustrates the chart obtained from the analysis performed between sib
and married attribute.
2. Frequency and value of sib
8

Figure 7: Result 2- frequency v/s value of sib
The above-shown image illustrates the chart obtained from the analysis performed between
frequency and value of sib.
3. Education-married-Wexperience
9

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Figure 8: Result 3-education, married, and Wexperience
The above-shown image illustrates the chart obtained from the analysis performed between
education, married, and Wexperience.
2.2 Build a Linear Regression model for predicting the salary of a person
To accomplish the analysis among the distinct attributes, model which is used called Linear
Regression model and is mainly used to complete the predictive analysis for determining the
relationship between variables such as independent and dependent (Anandarajan, Hill & Nolan,
2019).
For the prediction of the person’s income, we used to model linear regression using RapidMiner
tool. To obtain the results, import the file Salary.csv and analysis is performed by using some
operators.
10

Figure 9: Process design
The above-shown image illustrates the process design where several operators are used. The first
operator used is Retrieve Salary which retrieves all the data from the data sets Salary.csv and to
select the attribute, operator Select Attribute is used and the following attributes are selected.
Figure 10: Selected attributes
11

1 out of 17

Data Analysis and Visualization Project using RapidMiner and Tableau

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Related Documents

Data Science Project: Exploratory Data Analysis and Linear Regression

+13062052269

info@desklib.com

Data Analysis and Visualization Project using RapidMiner and Tableau

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

Data Science Project: Exploratory Data Analysis and Linear Regression

+13062052269

info@desklib.com