Semester Project: Business Intelligence Report Analysis
VerifiedAdded on 2023/04/03
|29
|3445
|100
Report
AI Summary
This report details a Business Intelligence project undertaken using RapidMiner to analyze Australian weather data. The project is divided into three tasks. Task 1 focuses on exploratory data analysis, decision tree modeling, and logistic regression modeling to predict rainfall. Task 2 involves researching and designing a high-level data warehouse architecture, including its main components and ethical considerations. Task 3 entails creating a crime dataset for a scenario dashboard. The report includes detailed explanations of each step, including the use of RapidMiner for data preparation, model building, and validation, as well as a discussion of data warehouse design and security concerns. The project aims to apply business intelligence techniques to solve real-world problems, demonstrating the practical application of data mining and data warehousing principles.

University
Semester
Business Intelligence
Student ID
Student Name
Submission Date
Semester
Business Intelligence
Student ID
Student Name
Submission Date
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Table of Contents
1. Project Description........................................................................................................................1
2. Task 1 - Rapid Miner.....................................................................................................................1
2.1 Exploratory Data Analysis on Weather AUS Data................................................................1
2.2 Decision Tree Model.............................................................................................................7
2.3 Logistic Regression Model....................................................................................................9
2.4 Final Decision Tree and Final Logistics Regression Models’ Validation............................10
Performance Vector.............................................................................................................................12
3. Task 2 - Research the Relevant Literature...................................................................................13
3.1 Architecture Design of a High Level Data Warehouse........................................................13
3.2 Proposed High Level Data Warehouse Architecture Design’s Main Components...............15
3.3 Security Privacy and the Ethical Concerns..........................................................................16
4. Task 3 – Scenario Dashboard......................................................................................................19
4.1 Crime Category....................................................................................................................19
4.2 Frequency of Occurrence.....................................................................................................20
4.3 Frequency of Crimes............................................................................................................21
4.4 Geographical Presentation of Each police Department Area...............................................21
References...........................................................................................................................................23
1. Project Description........................................................................................................................1
2. Task 1 - Rapid Miner.....................................................................................................................1
2.1 Exploratory Data Analysis on Weather AUS Data................................................................1
2.2 Decision Tree Model.............................................................................................................7
2.3 Logistic Regression Model....................................................................................................9
2.4 Final Decision Tree and Final Logistics Regression Models’ Validation............................10
Performance Vector.............................................................................................................................12
3. Task 2 - Research the Relevant Literature...................................................................................13
3.1 Architecture Design of a High Level Data Warehouse........................................................13
3.2 Proposed High Level Data Warehouse Architecture Design’s Main Components...............15
3.3 Security Privacy and the Ethical Concerns..........................................................................16
4. Task 3 – Scenario Dashboard......................................................................................................19
4.1 Crime Category....................................................................................................................19
4.2 Frequency of Occurrence.....................................................................................................20
4.3 Frequency of Crimes............................................................................................................21
4.4 Geographical Presentation of Each police Department Area...............................................21
References...........................................................................................................................................23


1. Project Description
We shall be making use of the Rapid Miner Tool for the data readiness, data reading, data
mining, information gathering, data evaluation and analysis on a specific data platform.
For our project we will be using the Australian Weather data and information sets. We shall
be dealing and also identifying the real world issues, use the business intelligences in solving
and getting solutions for complex organizational problems practically and creatively, in the
organization's systems applying the business intelligence's implementation for the business
processes. We have divided the entire project into 3 sub divisions,
Task 1st – Rapid Miner is the tool that we shall be suing for this project of Weather
forecasting and analysis of the weather data for the next day's weather depending on today's
weather conditions by analysing and evaluating the Climate data sets from the Australian
researchers. Now we shall use the following tools and methods for the same – Information
preparation, Business knowledge, Information understanding, the CRISP DM data mining
process’s modeling phase and Evaluation phase.
Task 2nd – We shall study and evaluate large amount of information collected from the
researchers and will use the above mentioned tools and methods to for and the required
results and outcomes for the data warehouse architecture and materials related for the same.
Task 3rd – A Crime dataset will be created for the dashboard.
The above topics and tasks will be evaluated, studied and discussed in detail.
2. Task 1 - Rapid Miner
For the first Task, we shall be using the various technics like applying data
preparation, data understanding, data modelling and CRISP DM data mining process’s
evaluation phases for the analysis of the possibility of rainfall for the next day prediction of
rainfall for the next day depending on data collected for today’s climatic conditions.
1
We shall be making use of the Rapid Miner Tool for the data readiness, data reading, data
mining, information gathering, data evaluation and analysis on a specific data platform.
For our project we will be using the Australian Weather data and information sets. We shall
be dealing and also identifying the real world issues, use the business intelligences in solving
and getting solutions for complex organizational problems practically and creatively, in the
organization's systems applying the business intelligence's implementation for the business
processes. We have divided the entire project into 3 sub divisions,
Task 1st – Rapid Miner is the tool that we shall be suing for this project of Weather
forecasting and analysis of the weather data for the next day's weather depending on today's
weather conditions by analysing and evaluating the Climate data sets from the Australian
researchers. Now we shall use the following tools and methods for the same – Information
preparation, Business knowledge, Information understanding, the CRISP DM data mining
process’s modeling phase and Evaluation phase.
Task 2nd – We shall study and evaluate large amount of information collected from the
researchers and will use the above mentioned tools and methods to for and the required
results and outcomes for the data warehouse architecture and materials related for the same.
Task 3rd – A Crime dataset will be created for the dashboard.
The above topics and tasks will be evaluated, studied and discussed in detail.
2. Task 1 - Rapid Miner
For the first Task, we shall be using the various technics like applying data
preparation, data understanding, data modelling and CRISP DM data mining process’s
evaluation phases for the analysis of the possibility of rainfall for the next day prediction of
rainfall for the next day depending on data collected for today’s climatic conditions.
1
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

For this the following parameters will be considered, (Ahmed Sherif., 2016):
First, we are exploratory data analysis on Australian Weather data.
Decision Tree Model to be prepared
Logistic Regression Model to be made
Final Decision Tree Mode will be validated and its outcome checked.
Logistic Regression Model will be validated and its outcome checked.
2.1 Exploratory Data Analysis on Weather AUS Data
By making use of the Rapid Miner tool, we shall apply this for the evaluation of the
information related to the Australian weather data. We will study every single factor in the
data set which affects the outcome of the study and also the inter-relationships between all
these parameters. Now the factors that will affect the selected variables will also depend upon
the following factors like parameters lost, faulty data inputs, least parameters, highest values
required, parameters used most often, standard deviation and more.
We shall now look into this aspect of the project and see how these play a role in the analysis
result.
We shall start with applying the Rapid Miner Tool for the exploratory provided by the
Australian Weather data in the below steps,
As shown in the below image, open the Rapid Miner and click on the new process,
2
First, we are exploratory data analysis on Australian Weather data.
Decision Tree Model to be prepared
Logistic Regression Model to be made
Final Decision Tree Mode will be validated and its outcome checked.
Logistic Regression Model will be validated and its outcome checked.
2.1 Exploratory Data Analysis on Weather AUS Data
By making use of the Rapid Miner tool, we shall apply this for the evaluation of the
information related to the Australian weather data. We will study every single factor in the
data set which affects the outcome of the study and also the inter-relationships between all
these parameters. Now the factors that will affect the selected variables will also depend upon
the following factors like parameters lost, faulty data inputs, least parameters, highest values
required, parameters used most often, standard deviation and more.
We shall now look into this aspect of the project and see how these play a role in the analysis
result.
We shall start with applying the Rapid Miner Tool for the exploratory provided by the
Australian Weather data in the below steps,
As shown in the below image, open the Rapid Miner and click on the new process,
2

Below is the image for the added the data on the created process,
Below image shows the specified data format,
3
Below image shows the specified data format,
3

Thus we have created a new process by sucessfully importing all the weather data on to the
system as shown in the below image (ANGRA & AHUJA, 2016).
Descriptive Statistics is the term used for the analysis of the Australian weather data by the
exploratory data analysis methodology and the image below shows that,
4
system as shown in the below image (ANGRA & AHUJA, 2016).
Descriptive Statistics is the term used for the analysis of the Australian weather data by the
exploratory data analysis methodology and the image below shows that,
4
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

.
5
5

Below is the image representation of the scatter plots for the weather information by utilizing
the exploratory data evaluation for the gathered information sets for the prediction of rainfall
for tomorrow based upon the climatic and weather conditions of today
6
the exploratory data evaluation for the gathered information sets for the prediction of rainfall
for tomorrow based upon the climatic and weather conditions of today
6

The Correlation between the Attributes is defined by the use of theExploratory data
analysis, and it is displayed in the below image (Czernicki, 2010).
7
analysis, and it is displayed in the below image (Czernicki, 2010).
7
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Below image shows the Scatter plots for the correlations,
We have successfully identified the Correlation values for each attributes.
2.2 Decision Tree Model
We shall now prepare a Decision Tree, “a decision support tool that uses a tree-like
model of decisions and their possible consequences, including chance event outcomes,
resource costs, and utility” based on the Rapid Miner Tool. On the given data set, the
Decision Tree will be implemented to get the desired results, in our case for the prediction of
8
We have successfully identified the Correlation values for each attributes.
2.2 Decision Tree Model
We shall now prepare a Decision Tree, “a decision support tool that uses a tree-like
model of decisions and their possible consequences, including chance event outcomes,
resource costs, and utility” based on the Rapid Miner Tool. On the given data set, the
Decision Tree will be implemented to get the desired results, in our case for the prediction of
8

the rainfall for tomorrow based upon the weather conditions for today and in addition by
using the appropriate dataset of data mining operators.
The decision tree’s output is shown in the below image,
The Decision tree’s description in parameters is given below,
Humidity3pm =?
| RainToday = NA
| | MinTemp =?
| | | Pressure9am =? NA {No=16, Yes=24, NA=1004}
| | | Pressure9am > 1008.700: No {No=11, Yes=2, NA=4}
| | | Pressure9am ≤ 1008.700: Yes {No=0, Yes=6, NA=2}
| | MinTemp > 6.800
| | | MaxTemp =? NA {No=0, Yes=0, NA=6}
| | | MaxTemp > 18.600: NA {No=15, Yes=7, NA=33}
| | | MaxTemp ≤ 18.600: Yes {No=0, Yes=3, NA=1}
9
using the appropriate dataset of data mining operators.
The decision tree’s output is shown in the below image,
The Decision tree’s description in parameters is given below,
Humidity3pm =?
| RainToday = NA
| | MinTemp =?
| | | Pressure9am =? NA {No=16, Yes=24, NA=1004}
| | | Pressure9am > 1008.700: No {No=11, Yes=2, NA=4}
| | | Pressure9am ≤ 1008.700: Yes {No=0, Yes=6, NA=2}
| | MinTemp > 6.800
| | | MaxTemp =? NA {No=0, Yes=0, NA=6}
| | | MaxTemp > 18.600: NA {No=15, Yes=7, NA=33}
| | | MaxTemp ≤ 18.600: Yes {No=0, Yes=3, NA=1}
9

| | MinTemp ≤ 6.800: No {No=9, Yes=7, NA=1}
| RainToday = No
| | MaxTemp =? No {No=108, Yes=8, NA=40}
| | MaxTemp > -0.150
| | | MaxTemp > 7.600: No {No=1739, Yes=346, NA=33}
| | | MaxTemp ≤ 7.600
| | | | MinTemp > -4.200
| | | | | MinTemp > -3.600: Yes {No=12, Yes=20, NA=4}
| | | | | MinTemp ≤ -3.600: NA {No=0, Yes=0, NA=2}
| | | | MinTemp ≤ -4.200: No {No=3, Yes=0, NA=0}
| | MaxTemp ≤ -0.150: Yes {No=1, Yes=2, NA=2}
| RainToday = Yes: Yes {No=368, Yes=410, NA=42}
Humidity3pm > 83.500
| Temp3pm > 36.400: No {No=8, Yes=0, NA=0}
| Temp3pm ≤ 36.400: Yes {No=1711, Yes=7555, NA=223}
Humidity3pm ≤ 83.500: No {No=100492, Yes=21893, NA=2134}
2.3 Logistic Regression Model
By making use of the Rapid Miner tool and with the collected information of present climatic
conditions, together with the proper data set values from the mining tools used in the project ,
we shall use the a logistic regression Model to build for the prediction of tomorrow’s rainfall
(k &Wadhawa, 2016).
10
| RainToday = No
| | MaxTemp =? No {No=108, Yes=8, NA=40}
| | MaxTemp > -0.150
| | | MaxTemp > 7.600: No {No=1739, Yes=346, NA=33}
| | | MaxTemp ≤ 7.600
| | | | MinTemp > -4.200
| | | | | MinTemp > -3.600: Yes {No=12, Yes=20, NA=4}
| | | | | MinTemp ≤ -3.600: NA {No=0, Yes=0, NA=2}
| | | | MinTemp ≤ -4.200: No {No=3, Yes=0, NA=0}
| | MaxTemp ≤ -0.150: Yes {No=1, Yes=2, NA=2}
| RainToday = Yes: Yes {No=368, Yes=410, NA=42}
Humidity3pm > 83.500
| Temp3pm > 36.400: No {No=8, Yes=0, NA=0}
| Temp3pm ≤ 36.400: Yes {No=1711, Yes=7555, NA=223}
Humidity3pm ≤ 83.500: No {No=100492, Yes=21893, NA=2134}
2.3 Logistic Regression Model
By making use of the Rapid Miner tool and with the collected information of present climatic
conditions, together with the proper data set values from the mining tools used in the project ,
we shall use the a logistic regression Model to build for the prediction of tomorrow’s rainfall
(k &Wadhawa, 2016).
10
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The below given operators are used for the making of the logistic Regression model,
Read CSV
Numerical to Polynomial operations
Polynomial by binomial classification operators
Binominal to Numerical operators
Logistics Regression Operations.
For predicting the tomorrow’s rainfall on the basis of the today’s weather conditions and data
mining sets, we shall be using the Logistic Regression Model, taken as per the ratio of odds
to the parameter outcomes and this value will be used for the forecast.
2.4 Final Decision Tree and Final Logistics Regression Models’ Validation
Application of the models and performance operators is carried out here where the
affirmation for the decision tree and the logistic regression models is carried out utilizing the
cross validation. The below image is the outcome for this (Maheshwari, 2016).
11
Read CSV
Numerical to Polynomial operations
Polynomial by binomial classification operators
Binominal to Numerical operators
Logistics Regression Operations.
For predicting the tomorrow’s rainfall on the basis of the today’s weather conditions and data
mining sets, we shall be using the Logistic Regression Model, taken as per the ratio of odds
to the parameter outcomes and this value will be used for the forecast.
2.4 Final Decision Tree and Final Logistics Regression Models’ Validation
Application of the models and performance operators is carried out here where the
affirmation for the decision tree and the logistic regression models is carried out utilizing the
cross validation. The below image is the outcome for this (Maheshwari, 2016).
11

For a Decision Tree Model
Performance Vector
12
Performance Vector
12

PerformanceVector
PerformanceVector:
Accuracy: 80.55%
ConfusionMatrix:
True: No Yes NA
No: 102370 22256 2212
Yes: 2092 7996 274
NA: 31 31 1045
Kappa: 0.331
ConfusionMatrix:
True: No Yes NA
No: 102370 22256 2212
Yes: 2092 7996 274
NA: 31 31 1045
For a Logistic Regression Model
13
PerformanceVector:
Accuracy: 80.55%
ConfusionMatrix:
True: No Yes NA
No: 102370 22256 2212
Yes: 2092 7996 274
NA: 31 31 1045
Kappa: 0.331
ConfusionMatrix:
True: No Yes NA
No: 102370 22256 2212
Yes: 2092 7996 274
NA: 31 31 1045
For a Logistic Regression Model
13
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

3. Task 2 - Research the Relevant Literature
3.1 Architecture Design of a High Level Data Warehouse
For a high level data warehouse, the below image displays the architecture design,
14
3.1 Architecture Design of a High Level Data Warehouse
For a high level data warehouse, the below image displays the architecture design,
14

15

3.2 Proposed High Level Data Warehouse Architecture Design’s Main
Components
Components of Data warehouse
Depending upon the RDBMS server, a few selected elements will gather around the main
central information repository for making an entire environment which shall control and
mange, function the Data Warehouse.
The 5 factors of the Data Warehouse are (MERCIER, 2017).
Database of the Data Warehouse
A data warehousing environment’s foundation is the basis on which the RDBMS
technology is utilized for and the same is implemented on the central database.
For the data warehousing, there is a constrained and limited utilization of the older and
traditional RDBMS system if it is implemented in the project, as well as it will make
optimized for the transactional database processing. This will affect the overall performance
of the system with the following affecting it aggregates, multi-table joins, ad-hoc query, are
resource intensive etc.
Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
Copying data from one or more sources into a destination system will sometimes result in the
difference of data on both the sides. Load and Data should remain unchanged throughout.
16
Components
Components of Data warehouse
Depending upon the RDBMS server, a few selected elements will gather around the main
central information repository for making an entire environment which shall control and
mange, function the Data Warehouse.
The 5 factors of the Data Warehouse are (MERCIER, 2017).
Database of the Data Warehouse
A data warehousing environment’s foundation is the basis on which the RDBMS
technology is utilized for and the same is implemented on the central database.
For the data warehousing, there is a constrained and limited utilization of the older and
traditional RDBMS system if it is implemented in the project, as well as it will make
optimized for the transactional database processing. This will affect the overall performance
of the system with the following affecting it aggregates, multi-table joins, ad-hoc query, are
resource intensive etc.
Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
Copying data from one or more sources into a destination system will sometimes result in the
difference of data on both the sides. Load and Data should remain unchanged throughout.
16
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Migration tools and transformation shall be employed for this purpose and for this ETL
(Extract, Transform and Load) will deployed for transform the data into a unified
format by transforming, describing all the necessary modifications.
Metadata
Helping in creating, conserving and supervising of the Data in the Data Warehouses,
“Metadata” "data [information] that provides information about other data", many different
distinct types of metadata exist, like the structural metadata, reference metadata,
administrative metadata, statistical metadata and the descriptive metadata. Playing a crucial
role, Metadata by specifies the usage, source, parameter, variables and the data warehouse’s
features for the data within the Data Warehouse Architecture, helps in the way the Data is
managed, processed and used in the data warehouses and also by other mining tools.
Query Tools
The main objects of the data warehousing, is for making the important and the strategic
decisions. For this we use “Query Tools”. A query is a request for data or information from a
database table or from the data warehouses; they act as a mediator between the users and the
administrators.
The four varieties of this tool are,
1. Application Development tools
2. Query and reporting tools
3. OLAP tools
4. Data mining tools
3.3 Security Privacy and the Ethical Concerns
For every System administrator the task of keeping the system safe and secured is always a
tough task. With the amount of data passing by within the system and also outside the system
it becomes more and more important to have a strong secured environment. The Security
system should be such that it does not risk the vast amount of vital information in the system
plus at the same time does not make things complicated by implementing complex tools for
security. All this shall be addressed and analysed in this part of the project.
We should ask the following questions related to the safety and security of the system,
17
(Extract, Transform and Load) will deployed for transform the data into a unified
format by transforming, describing all the necessary modifications.
Metadata
Helping in creating, conserving and supervising of the Data in the Data Warehouses,
“Metadata” "data [information] that provides information about other data", many different
distinct types of metadata exist, like the structural metadata, reference metadata,
administrative metadata, statistical metadata and the descriptive metadata. Playing a crucial
role, Metadata by specifies the usage, source, parameter, variables and the data warehouse’s
features for the data within the Data Warehouse Architecture, helps in the way the Data is
managed, processed and used in the data warehouses and also by other mining tools.
Query Tools
The main objects of the data warehousing, is for making the important and the strategic
decisions. For this we use “Query Tools”. A query is a request for data or information from a
database table or from the data warehouses; they act as a mediator between the users and the
administrators.
The four varieties of this tool are,
1. Application Development tools
2. Query and reporting tools
3. OLAP tools
4. Data mining tools
3.3 Security Privacy and the Ethical Concerns
For every System administrator the task of keeping the system safe and secured is always a
tough task. With the amount of data passing by within the system and also outside the system
it becomes more and more important to have a strong secured environment. The Security
system should be such that it does not risk the vast amount of vital information in the system
plus at the same time does not make things complicated by implementing complex tools for
security. All this shall be addressed and analysed in this part of the project.
We should ask the following questions related to the safety and security of the system,
17

What are the privacy information status and security prospects of the system, and how
the private information of the individuals and data is collected and used within the
system?
Who is the person to have the authority to access such information and use it and the
controlling privilege for the same?
What is the status of the Information policy of the system regarding as to how much
information should be shared and how much can be used?
Who has the authority to cease this information from sharing or using it in the
system?
Overall, on the internet with the easy availability of it for everyone, the private data and user
information, privacy issues and personal data becomes a very important issue. For this and
for all the above questions, we shall carry out a proper, detailed study for how the technology
and the available tools for the secured data and value assets, and this same will be used for
the following,
Use of methodology , technology, and the Guidance
Disclosure, Unauthorized access
Modification/Disruption
Destruction,
Recording,
Inspection,
Privacy vs. security
The main priority and focus will be given to the Data privacy policy and on its use and also
misuse, the governance of individual data and the use of security related tools. Focus will be
especially on the gathering, misuse and sharing of the data of an individual. The most
worrying problem is the security of the system and how to safe guards this vital type of data,
especially from outside potential threats and data being stolen and used for malicious
activities.
18
the private information of the individuals and data is collected and used within the
system?
Who is the person to have the authority to access such information and use it and the
controlling privilege for the same?
What is the status of the Information policy of the system regarding as to how much
information should be shared and how much can be used?
Who has the authority to cease this information from sharing or using it in the
system?
Overall, on the internet with the easy availability of it for everyone, the private data and user
information, privacy issues and personal data becomes a very important issue. For this and
for all the above questions, we shall carry out a proper, detailed study for how the technology
and the available tools for the secured data and value assets, and this same will be used for
the following,
Use of methodology , technology, and the Guidance
Disclosure, Unauthorized access
Modification/Disruption
Destruction,
Recording,
Inspection,
Privacy vs. security
The main priority and focus will be given to the Data privacy policy and on its use and also
misuse, the governance of individual data and the use of security related tools. Focus will be
especially on the gathering, misuse and sharing of the data of an individual. The most
worrying problem is the security of the system and how to safe guards this vital type of data,
especially from outside potential threats and data being stolen and used for malicious
activities.
18

Big data privacy in data generation phase
Data formation types depends upon the type of data and how it is stored on the system and
based on this, we have the two varieties of Data types i.e Active data generation and Passive
data generation type. When the data comes from the usage of the data owner’s online
activities, like browsing and the user is unaware that his browsing history is being stored by
third party, is called as Passive Data Generation. When, the user gives this information
voluntarily, like filling up some form, than we call it as Active Data generation.
Big data privacy in data storage phase
In saving vast amount of information, we have advancements like data saving and then
keeping it and collection methodologies like cloud storage technology which is a very
alarming situation if there is a compromise in the data storage system. This can be
unimaginably dangerous and shocking as this information can be misused by anyone.
Gathering various datasets from many data centers can be another situation where the risk of
safety and security is high like in a distributed environment, which can be at risk and
vulnerable.
Database lines of information saving methods, File data saving technics, Application level
encryption schemes, and the Media level security schemes, are the four main types of
divisions in the conventional security method of security.
Big data privacy preserving in data processing
The categorized systems of the big data processing pattern are the batch, stream, graph, and
artificial learning methodology. For the overall protection of the data, privacy issues in the
data processing parts , we can make further classifications and into couple of phases. Thus
more emphasis and importance has to be given to the first stage of data collection as it may
involve data from the users and individuals and becomes more vital of safeguarding
information from unsolicited disclosures. In the second phase, collecting meaningful data and
resources from the collected data so that the privacy of the user is not violated becomes the
objective.
After seeing and discussing about the traditional ways in which vast amount of data was
secured and used, we shall now look into the modern and new methods for the same, and
giving adequate privacy to the suitable and varied areas.
19
Data formation types depends upon the type of data and how it is stored on the system and
based on this, we have the two varieties of Data types i.e Active data generation and Passive
data generation type. When the data comes from the usage of the data owner’s online
activities, like browsing and the user is unaware that his browsing history is being stored by
third party, is called as Passive Data Generation. When, the user gives this information
voluntarily, like filling up some form, than we call it as Active Data generation.
Big data privacy in data storage phase
In saving vast amount of information, we have advancements like data saving and then
keeping it and collection methodologies like cloud storage technology which is a very
alarming situation if there is a compromise in the data storage system. This can be
unimaginably dangerous and shocking as this information can be misused by anyone.
Gathering various datasets from many data centers can be another situation where the risk of
safety and security is high like in a distributed environment, which can be at risk and
vulnerable.
Database lines of information saving methods, File data saving technics, Application level
encryption schemes, and the Media level security schemes, are the four main types of
divisions in the conventional security method of security.
Big data privacy preserving in data processing
The categorized systems of the big data processing pattern are the batch, stream, graph, and
artificial learning methodology. For the overall protection of the data, privacy issues in the
data processing parts , we can make further classifications and into couple of phases. Thus
more emphasis and importance has to be given to the first stage of data collection as it may
involve data from the users and individuals and becomes more vital of safeguarding
information from unsolicited disclosures. In the second phase, collecting meaningful data and
resources from the collected data so that the privacy of the user is not violated becomes the
objective.
After seeing and discussing about the traditional ways in which vast amount of data was
secured and used, we shall now look into the modern and new methods for the same, and
giving adequate privacy to the suitable and varied areas.
19
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

De-identification
Generalization and sanitization of the data should be carried out by the use of suppression
methods, so that the user’s privacy and data is safe and secured before the Data mining
procedure starts.
Most important and very high level tool in privacy protection in big data analytics made for
individuals self-data preserving Information mining; we will use the old and tried out method
of De-identification. To reduce the risks from knowing the identification and for the
improvement of the traditional privacy-preserving methods of the data mining, we shall use
the concepts of t-closeness, l-diversity and k-anonymity into the system. As we know from
past records and observations, in big data the knowing the identification because as a threat,
and it helps to gather additional outside data assistance for de-identification. Thus, De-
identification isn’t sufficient for the security of the entire system for the individual’s private
data.
Privacy-preserving is undergoing a big challenge as either the flexibility problems,
which includes the effectiveness or the risks of de-identification is not fulfilled properly.
De-identification could be highly practicable for the individual’s security of the huge data
analytics if the effective privacy saving tools are developed for helping in mitigation of re-
identification risks. L-diversity, T-closeness, and K-anonymity are the three methods for the
De-identification for privacy-preserving.
For the field of privacy of these approaches, we use some common terms like,
Identifier attributes contains driver license, full name, number of the social security
etc. that are one of a kind and vary as per the different users.
Quasi-identifier attributes refers to a set of information, for instance, age, gender, zip
code, birthdate etc. To again know the individuals, the information given on top could
be mixed with the rest of the external data.
Sensitive attributes refer to individuals self and exclusive data, for instance, salary,
sickness, and so on.
Insensitive attributes refer to the information that is general and innocuous.
Equivalence classes refer to a set of records which involves the identical data on the
quasi-identifiers.
20
Generalization and sanitization of the data should be carried out by the use of suppression
methods, so that the user’s privacy and data is safe and secured before the Data mining
procedure starts.
Most important and very high level tool in privacy protection in big data analytics made for
individuals self-data preserving Information mining; we will use the old and tried out method
of De-identification. To reduce the risks from knowing the identification and for the
improvement of the traditional privacy-preserving methods of the data mining, we shall use
the concepts of t-closeness, l-diversity and k-anonymity into the system. As we know from
past records and observations, in big data the knowing the identification because as a threat,
and it helps to gather additional outside data assistance for de-identification. Thus, De-
identification isn’t sufficient for the security of the entire system for the individual’s private
data.
Privacy-preserving is undergoing a big challenge as either the flexibility problems,
which includes the effectiveness or the risks of de-identification is not fulfilled properly.
De-identification could be highly practicable for the individual’s security of the huge data
analytics if the effective privacy saving tools are developed for helping in mitigation of re-
identification risks. L-diversity, T-closeness, and K-anonymity are the three methods for the
De-identification for privacy-preserving.
For the field of privacy of these approaches, we use some common terms like,
Identifier attributes contains driver license, full name, number of the social security
etc. that are one of a kind and vary as per the different users.
Quasi-identifier attributes refers to a set of information, for instance, age, gender, zip
code, birthdate etc. To again know the individuals, the information given on top could
be mixed with the rest of the external data.
Sensitive attributes refer to individuals self and exclusive data, for instance, salary,
sickness, and so on.
Insensitive attributes refer to the information that is general and innocuous.
Equivalence classes refer to a set of records which involves the identical data on the
quasi-identifiers.
20

K-anonymity
The bringing out the information is known as to have the k-anonymity property if for every
single individual in the release couldn’t be known by even the k-1 individuals whose details
are displayed during the release. Every single row of a table denotes a record that relates with
a specific individual from a public and the different rows’ entries are not required to be
distinctive and the database refers to a table that comprises of n rows and m columns in the
context of k-anonymization problems. Linked to the public, the various columns contain
different values as per these attributes.
L-diversity
L-diversity means the type of group on which the anonymization, developed and used
for the securing of the data sets’ privacy, by minimizing the granularity of the information
representation there will be loss of data which is valuable for the information management or
the data mining algorithms if for gaining some privacy, there is a reduction in the trade-off.
At least k distinct records in a data provide the record maps.
Including the commonization and subduing in a way that provides the recorded maps
on at least k distinct records in a data, by reducing the information granularity’s
representation using methods like, l-diversity model (Distinct, Recursive and Entropy) refers
to k-anonymity model execution.
Some of the k-anonymity models are managed by the L-diversity model including the
weaknesses, especially when in a group the sensitive values exhibits homogeneity and where
protecting relative delicate variables and data which were commonized or subdued is not
same as the protected identities to the level of k-individuals. Dependent on a range of
sensitive attributes, the anonymization mechanism, encouragement of inter-group diversity
for the delicate values is included in l-diversity model. The overall security will be better,
even if there may be
Issues and factors in the evaluation for the fictitious data that has to be inserted if we wish to
make data L-diverse via delicate attribute which contains not that much distinct value. As L-
diversity method is exposed to the commonness and skewness attacks, this will not be able to
prevent attribute disclosure.
21
The bringing out the information is known as to have the k-anonymity property if for every
single individual in the release couldn’t be known by even the k-1 individuals whose details
are displayed during the release. Every single row of a table denotes a record that relates with
a specific individual from a public and the different rows’ entries are not required to be
distinctive and the database refers to a table that comprises of n rows and m columns in the
context of k-anonymization problems. Linked to the public, the various columns contain
different values as per these attributes.
L-diversity
L-diversity means the type of group on which the anonymization, developed and used
for the securing of the data sets’ privacy, by minimizing the granularity of the information
representation there will be loss of data which is valuable for the information management or
the data mining algorithms if for gaining some privacy, there is a reduction in the trade-off.
At least k distinct records in a data provide the record maps.
Including the commonization and subduing in a way that provides the recorded maps
on at least k distinct records in a data, by reducing the information granularity’s
representation using methods like, l-diversity model (Distinct, Recursive and Entropy) refers
to k-anonymity model execution.
Some of the k-anonymity models are managed by the L-diversity model including the
weaknesses, especially when in a group the sensitive values exhibits homogeneity and where
protecting relative delicate variables and data which were commonized or subdued is not
same as the protected identities to the level of k-individuals. Dependent on a range of
sensitive attributes, the anonymization mechanism, encouragement of inter-group diversity
for the delicate values is included in l-diversity model. The overall security will be better,
even if there may be
Issues and factors in the evaluation for the fictitious data that has to be inserted if we wish to
make data L-diverse via delicate attribute which contains not that much distinct value. As L-
diversity method is exposed to the commonness and skewness attacks, this will not be able to
prevent attribute disclosure.
21

4. Task 3 – Scenario Dashboard
Consisting of four crimes event views of the LA crimes 2012 - 2016 data sets, this
task is used to create the tableau dashboard.
Crime Category
Specific Year and Specific crime occurring for the policy department sector, a image
shown below for each particular crime,
4.1 Frequency of Occurrence
A selected crime and the frequency of occurrence is displayed on the below figure
(Tahyudin, 2015).
22
Consisting of four crimes event views of the LA crimes 2012 - 2016 data sets, this
task is used to create the tableau dashboard.
Crime Category
Specific Year and Specific crime occurring for the policy department sector, a image
shown below for each particular crime,
4.1 Frequency of Occurrence
A selected crime and the frequency of occurrence is displayed on the below figure
(Tahyudin, 2015).
22
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4.2 Frequency of Crimes
Each crime and the frequency of classification by police department area is displayed in the
below figure
4.3 Geographical Presentation of Each police Department Area
The geographical presentation of each police department area is shown in the below
figure,
23
Each crime and the frequency of classification by police department area is displayed in the
below figure
4.3 Geographical Presentation of Each police Department Area
The geographical presentation of each police department area is shown in the below
figure,
23

The Final Dashboard is illustrated as below.
24
24

25
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

References
Ahmed Sherif. (2016). Practical Business Intelligence.Packt Publishing.
ANGRA, S. and AHUJA, S. (2016).Analysis of student’s data using rapid miner. Journal on
Today's Ideas - Tomorrow's Technologies, 4(1), pp.49-58.
Czernicki, B. (2010). Silverlight 4 Business Intelligence Software. New York: Apress L.P.
k, T. and Wadhawa, M. (2016). Analysis and Comparison Study of Data Mining Algorithms
Using Rapid Miner. International Journal of Computer Science, Engineering and
Applications, 6(1), pp.9-21.
Maheshwari, A. (2016). Business intelligence and data mining.
MERCIER, L. (2017). TABLEAU DE PARIS. [Place of publication not identified]:
FORGOTTEN Books.
Tahyudin, I. (2015). Data Mining (Comparison Between Manual and Rapid Miner Process).
Saarbrücken: LAP LAMBERT Academic Publishing.
26
Ahmed Sherif. (2016). Practical Business Intelligence.Packt Publishing.
ANGRA, S. and AHUJA, S. (2016).Analysis of student’s data using rapid miner. Journal on
Today's Ideas - Tomorrow's Technologies, 4(1), pp.49-58.
Czernicki, B. (2010). Silverlight 4 Business Intelligence Software. New York: Apress L.P.
k, T. and Wadhawa, M. (2016). Analysis and Comparison Study of Data Mining Algorithms
Using Rapid Miner. International Journal of Computer Science, Engineering and
Applications, 6(1), pp.9-21.
Maheshwari, A. (2016). Business intelligence and data mining.
MERCIER, L. (2017). TABLEAU DE PARIS. [Place of publication not identified]:
FORGOTTEN Books.
Tahyudin, I. (2015). Data Mining (Comparison Between Manual and Rapid Miner Process).
Saarbrücken: LAP LAMBERT Academic Publishing.
26
1 out of 29

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.