This document discusses a business intelligence project on weather data using Rapid Miner tool. It includes exploratory data analysis, decision tree and logistic regression models. It also covers research on high-level data warehouse architecture design and security concerns.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
University Semester Business Intelligence Student ID Student Name Submission Date
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Table of Contents 1.Project Description........................................................................................................................1 2.Task 1 - Rapid Miner.....................................................................................................................1 2.1Exploratory Data Analysis on Weather AUS Data................................................................1 2.2Decision Tree Model.............................................................................................................7 2.3Logistic Regression Model....................................................................................................9 2.4Final Decision Tree and Final Logistics Regression Models’ Validation............................10 Performance Vector.............................................................................................................................12 3.Task 2 - Research the Relevant Literature...................................................................................13 3.1Architecture Design of a High Level Data Warehouse........................................................13 3.2Proposed High Level Data Warehouse Architecture Design’s Main Components...............15 3.3Security Privacy and the Ethical Concerns..........................................................................16 4.Task 3 – Scenario Dashboard......................................................................................................19 4.1Crime Category....................................................................................................................19 4.2Frequency of Occurrence.....................................................................................................20 4.3Frequency of Crimes............................................................................................................21 4.4Geographical Presentation of Each police Department Area...............................................21 References...........................................................................................................................................23
1.Project Description We shall be making use of the Rapid Miner Tool for the data readiness, data reading, data mining, information gathering, data evaluation and analysis on a specific data platform. For our project we will be using the Australian Weather data and information sets. We shall be dealing and also identifying the real world issues, use thebusiness intelligences in solving and getting solutions forcomplex organizational problems practically and creatively, in the organization's systems applying the business intelligence's implementation for the business processes. We have divided the entire project into 3 sub divisions, Task 1st– Rapid Miner is the tool that we shall be suing for this project of Weather forecasting and analysis of the weather data for the next day's weather depending on today's weather conditions by analysing and evaluating the Climate data sets from the Australian researchers. Now we shall use the following tools and methods for the same – Information preparation, Business knowledge, Information understanding, the CRISP DM data mining process’s modeling phase and Evaluation phase. Task 2nd– We shall study and evaluate large amount of information collected from the researchers and will use the above mentioned tools and methods to for and the required results and outcomes for the data warehouse architecture and materials related for the same. Task 3rd– A Crime dataset will be created for the dashboard. The above topics and tasks will be evaluated, studied and discussed in detail. 2.Task 1 - Rapid Miner For the first Task, we shall be using the various technics like applying data preparation, data understanding, data modelling and CRISP DM data mining process’s evaluation phases for the analysis of the possibility of rainfall for the next day prediction of rainfall for the next day depending on data collected for today’s climatic conditions. 1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
For this the following parameters will be considered,(Ahmed Sherif., 2016): First, we are exploratory data analysis on Australian Weather data. Decision Tree Model to be prepared Logistic Regression Model to be made Final Decision Tree Mode will be validated and its outcome checked. Logistic Regression Model will be validated and its outcome checked. 2.1Exploratory Data Analysis on Weather AUS Data By making use of the Rapid Miner tool, we shall apply this for the evaluation of the information related to the Australian weather data. We will study every single factor in the data set which affects the outcome of the study and also the inter-relationships between all these parameters. Now the factors that will affect the selected variables will also depend upon the following factors like parameters lost, faulty data inputs, least parameters, highest values required, parameters used most often, standard deviation and more. We shall now look into this aspect of the project and see how these play a role in the analysis result. We shall start with applying the Rapid Miner Tool for the exploratory provided by the Australian Weather data in the below steps, As shown in the below image, open the Rapid Miner and click on the new process, 2
Below is the image for the added the data on the created process, Below image shows the specified data format, 3
Thus we have created a new process by sucessfully importing all the weather data on to the system as shown in the below image (ANGRA & AHUJA, 2016). Descriptive Statisticsis the term used for the analysis of the Australian weather data by the exploratory data analysis methodology and the image below shows that, 4
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Below is the image representation of the scatter plots for the weather information by utilizing the exploratory data evaluation for the gathered information sets for the prediction of rainfall for tomorrow based upon the climatic and weather conditions of today 6
TheCorrelation between the Attributesis defined by the use of theExploratory data analysis, and it is displayed in the below image (Czernicki, 2010). 7
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Below image shows the Scatter plots for the correlations, We have successfully identified the Correlation values for each attributes. 2.2Decision Tree Model We shall now prepare a Decision Tree, “a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility”based on the Rapid Miner Tool. On the given data set, the Decision Tree will be implemented to get the desired results, in our case for the prediction of 8
the rainfall for tomorrow based upon the weather conditions for today and in addition by using theappropriate dataset of data mining operators. The decision tree’s output is shown in the below image, The Decision tree’s description in parameters is given below, Humidity3pm =? |RainToday = NA ||MinTemp =? |||Pressure9am =? NA {No=16, Yes=24, NA=1004} |||Pressure9am > 1008.700: No {No=11, Yes=2, NA=4} |||Pressure9am ≤ 1008.700: Yes {No=0, Yes=6, NA=2} ||MinTemp > 6.800 |||MaxTemp =? NA {No=0, Yes=0, NA=6} |||MaxTemp > 18.600: NA {No=15, Yes=7, NA=33} |||MaxTemp ≤ 18.600: Yes {No=0, Yes=3, NA=1} 9
||MinTemp ≤ 6.800: No {No=9, Yes=7, NA=1} |RainToday = No ||MaxTemp =? No {No=108, Yes=8, NA=40} ||MaxTemp > -0.150 |||MaxTemp > 7.600: No {No=1739, Yes=346, NA=33} |||MaxTemp ≤ 7.600 ||||MinTemp > -4.200 |||||MinTemp > -3.600: Yes {No=12, Yes=20, NA=4} |||||MinTemp ≤ -3.600: NA {No=0, Yes=0, NA=2} ||||MinTemp ≤ -4.200: No {No=3, Yes=0, NA=0} ||MaxTemp ≤ -0.150: Yes {No=1, Yes=2, NA=2} |RainToday = Yes: Yes {No=368, Yes=410, NA=42} Humidity3pm > 83.500 |Temp3pm > 36.400: No {No=8, Yes=0, NA=0} |Temp3pm ≤ 36.400: Yes {No=1711, Yes=7555, NA=223} Humidity3pm ≤ 83.500: No {No=100492, Yes=21893, NA=2134} 2.3Logistic Regression Model By making use of the Rapid Miner tool and with the collected information of present climatic conditions, together with the proper data set values from the mining tools used in the project , we shall use the a logistic regression Model to build for the prediction of tomorrow’s rainfall (k &Wadhawa, 2016). 10
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
The below given operators are used for the making of the logistic Regression model, Read CSV Numerical to Polynomial operations Polynomial by binomial classification operators Binominal to Numerical operators Logistics Regression Operations. For predicting the tomorrow’s rainfall on the basis of the today’s weather conditions and data mining sets, we shall be using the Logistic Regression Model, taken as per the ratio of odds to the parameter outcomes and this value will be used for the forecast. 2.4Final Decision Tree and Final Logistics Regression Models’ Validation Applicationofthemodelsandperformanceoperatorsiscarriedoutherewherethe affirmation for the decision tree and the logistic regression models is carried out utilizing the cross validation. The below image is the outcome for this (Maheshwari, 2016). 11
For a Decision Tree Model Performance Vector 12
PerformanceVector PerformanceVector: Accuracy: 80.55% ConfusionMatrix: True:NoYesNA No:102370222562212 Yes:20927996274 NA:31311045 Kappa: 0.331 ConfusionMatrix: True:NoYesNA No:102370222562212 Yes:20927996274 NA:31311045 For a Logistic Regression Model 13
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
3.Task 2 - Research the Relevant Literature 3.1Architecture Design of a High Level Data Warehouse For a high level data warehouse, the below image displays the architecture design, 14
15
3.2ProposedHighLevelDataWarehouseArchitectureDesign’sMain Components Components of Data warehouse Depending upon the RDBMS server, a few selected elements will gather around the main central information repository for making an entire environment which shall control and mange, function the Data Warehouse. The 5 factors of the Data Warehouse are (MERCIER, 2017). Database of the Data Warehouse A data warehousing environment’s foundation is the basis on which the RDBMS technology is utilized for and the same is implemented on the central database. For the data warehousing, there is a constrained and limited utilization of the older and traditional RDBMS system if it is implemented in the project, as well as it will make optimized for the transactional database processing. This will affect the overall performance of the system with the following affecting it aggregates, multi-table joins, ad-hoc query, are resource intensive etc. Sourcing, Acquisition, Clean-up and Transformation Tools (ETL) Copying data from one or more sources into a destination system will sometimes result in the difference of data on both the sides. Load and Data should remain unchanged throughout. 16
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Migration tools and transformation shall be employed for this purpose and for this ETL (Extract, Transform and Load) will deployed for transform the data into aunified format by transforming, describing all the necessary modifications. Metadata Helping in creating, conserving and supervising of the Data in the Data Warehouses, “Metadata”"data [information] that provides information about other data", many different distincttypesofmetadataexist,likethestructuralmetadata,referencemetadata, administrative metadata, statistical metadata and the descriptive metadata. Playing a crucial role, Metadata by specifies the usage, source, parameter, variables and the data warehouse’s features for the data within the Data Warehouse Architecture, helps in the way the Data is managed, processed and used in the data warehouses and also by other mining tools. Query Tools The main objects of the data warehousing, is for making the important and the strategic decisions. For this we use “Query Tools”.Aqueryis a request for data or information from a database table or from the data warehouses; they act as a mediator between the users and the administrators. The four varieties of this tool are, 1.Application Development tools 2.Query and reporting tools 3.OLAP tools 4.Data mining tools 3.3Security Privacy and the Ethical Concerns For every System administrator the task of keeping the system safe and secured is always a tough task. With the amount of data passing by within the system and also outside the system it becomes more and more important to have a strong secured environment. The Security system should be such that it does not risk the vast amount of vital information in the system plus at the same time does not make things complicated by implementing complex tools for security. All this shall be addressed and analysed in this part of the project. We should ask the following questions related to the safety and security of the system, 17
What are the privacy information status and security prospects of the system, and how the private information of the individuals and data is collected and used within the system? Who is the person to have the authority to access such information and use it and the controlling privilege for the same? What is the status of the Information policy of the system regarding as to how much information should be shared and how much can be used? Who has the authority to cease this information from sharing or using it in the system? Overall, on the internet with the easy availability of it for everyone, the private data and user information, privacy issues and personal data becomes a very important issue. For this and for all the above questions, we shall carry out a proper, detailed study for how the technology and the available tools for the secured data and value assets, and this same will be used for the following, Use of methodology , technology, and the Guidance Disclosure, Unauthorized access Modification/Disruption Destruction, Recording, Inspection, Privacyvs.security The main priority and focus will be given to the Data privacy policy and on its use and also misuse, the governance of individual data and the use of security related tools. Focus will be especially on the gathering, misuse and sharing of the data of an individual. The most worrying problem is the security of the system and how to safe guards this vital type of data, especially from outside potential threats and data being stolen and used for malicious activities. 18
Big data privacy in data generation phase Data formation types depends upon the type of data and how it is stored on the system and based on this, we have the two varieties of Data types i.e Active data generation and Passive data generation type. When the data comes from the usage of the data owner’s online activities, like browsing and the user is unaware that his browsing history is being stored by third party, is called as Passive Data Generation. When, the user gives this information voluntarily, like filling up some form, than we call it as Active Data generation. Big data privacy in data storage phase In saving vast amount of information, we have advancements like data saving and then keeping it and collection methodologies like cloud storage technology which is a very alarming situation if there is a compromise in the data storage system. This can be unimaginably dangerous and shocking as this information can be misused by anyone. Gathering various datasets from many data centers can be another situation where the risk of safety and security is high like in a distributed environment, which can be at risk and vulnerable. Database lines of information saving methods, File data saving technics, Application level encryption schemes, and the Media level security schemes, are the four main types of divisions in the conventional security method of security. Big data privacy preserving in data processing The categorized systems of the big data processing pattern are the batch, stream, graph, and artificial learning methodology. For the overall protection of the data, privacy issues in the data processing parts , we can make further classifications and into couple of phases. Thus more emphasis and importance has to be given to the first stage of data collection as it may involve data from the users and individualsand becomes more vital of safeguarding information from unsolicited disclosures. In the second phase, collecting meaningful data and resources from the collected data so that the privacy of the user is not violated becomes the objective. After seeing and discussing about the traditional ways in which vast amount of data was secured and used, we shall now look into the modern and new methods for the same, and giving adequateprivacy to the suitable and varied areas. 19
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
De-identification Generalization and sanitization of the data should be carried out by the use of suppression methods, so that the user’s privacy and data is safe and secured before the Data mining procedure starts. Most important and very high level tool in privacy protection in big data analytics made for individuals self-data preserving Information mining; we will use the old and tried out method of De-identification. To reduce the risks from knowing the identification and for the improvement of the traditional privacy-preserving methods of the data mining, we shall use the concepts of t-closeness, l-diversity and k-anonymity into the system. As we know from past records and observations, in big data the knowing the identification because as a threat, and it helps to gather additional outside data assistance for de-identification. Thus, De- identification isn’t sufficient for the security of the entire system for the individual’s private data. Privacy-preserving is undergoing a big challenge as either the flexibility problems, which includes the effectiveness or the risks of de-identification is not fulfilled properly. De-identification could be highly practicable for the individual’s security of the huge data analytics if the effective privacy saving tools are developed for helping in mitigation of re- identification risks. L-diversity, T-closeness, and K-anonymity are the three methods for the De-identification for privacy-preserving. For the field of privacy of these approaches, we use some common terms like, Identifier attributescontains driver license, full name, number of the social security etc. that are one of a kind and vary as per the different users. Quasi-identifier attributes refers to aset of information, for instance, age, gender, zip code, birthdate etc. To again know the individuals, the information given on top could be mixed with the rest of the external data. Sensitive attributes refer to individualsself and exclusive data, for instance, salary, sickness, and so on. Insensitive attributesrefer to the information that is general and innocuous. Equivalence classesrefer to a set of records which involves the identical data on the quasi-identifiers. 20
K-anonymity The bringing out the information is knownas to have thek-anonymity property if for every single individual in the release couldn’t be known by even the k-1 individuals whose details are displayed during the release. Every single row of a table denotes a record that relates with a specific individual from a public and the different rows’ entries are not required to be distinctive and the database refers to a table that comprises ofnrows andmcolumns in the context ofk-anonymization problems. Linked to the public, the various columns contain different values as per these attributes. L-diversity L-diversity means the type of group on which the anonymization, developed and used for the securing of the data sets’ privacy, by minimizing the granularity of the information representation there will be loss of data which is valuable for the information management or the data mining algorithms if for gaining some privacy, there is a reduction in the trade-off. At leastkdistinct records in a data provide the record maps. Including the commonization and subduing in a way that provides the recorded maps onatleastkdistinctrecordsinadata,byreducingtheinformationgranularity’s representation using methods like,l-diversity model (Distinct, Recursive and Entropy) refers tok-anonymitymodel execution. Some of thek-anonymity models are managed by the L-diversity model including the weaknesses, especially when in a group the sensitive values exhibits homogeneity and where protecting relative delicate variables and data which were commonized or subdued is not same as the protected identities to the level ofk-individuals. Dependent on a range of sensitive attributes, the anonymization mechanism, encouragement of inter-group diversity for the delicate values is included inl-diversity model. The overall security will be better, even if there may be Issues and factors in the evaluation for the fictitious data that has to be inserted if we wish to make data L-diverse via delicate attribute which contains not that much distinct value. As L- diversity method is exposed to the commonness and skewness attacks, this will not be able to prevent attribute disclosure. 21
4.Task 3 – Scenario Dashboard Consisting of four crimes event views of the LA crimes 2012 - 2016 data sets, this task is used to create the tableau dashboard. Crime Category Specific Year and Specific crime occurring for the policy department sector, a image shown below for each particular crime, 4.1Frequency of Occurrence Aselected crimeand thefrequency of occurrence is displayed on the below figure (Tahyudin, 2015). 22
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
4.2Frequency of Crimes Each crime and the frequency of classification by police department area is displayed in the below figure 4.3Geographical Presentation of Each police Department Area The geographical presentation of each police department area is shown in the below figure, 23
The Final Dashboard is illustrated as below. 24
25
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
References Ahmed Sherif. (2016).Practical Business Intelligence.Packt Publishing. ANGRA, S. and AHUJA, S. (2016).Analysis of student’s data using rapid miner.Journal on Today's Ideas - Tomorrow's Technologies, 4(1), pp.49-58. Czernicki, B. (2010).Silverlight 4 Business Intelligence Software. New York: Apress L.P. k, T. and Wadhawa, M. (2016). Analysis and Comparison Study of Data Mining Algorithms Using Rapid Miner.International Journal of Computer Science, Engineering and Applications, 6(1), pp.9-21. Maheshwari, A. (2016).Business intelligence and data mining. MERCIER, L. (2017).TABLEAU DE PARIS. [Place of publication not identified]: FORGOTTEN Books. Tahyudin, I. (2015).Data Mining (Comparison Between Manual and Rapid Miner Process). Saarbrücken: LAP LAMBERT Academic Publishing. 26