Introduction to Data Analytics: Classifier Project and Analysis Report

Verified

Added on 2023/03/21

AI Summary

This report presents a data analytics project focused on the implementation of a classifier. It begins with an introduction to data mining and its various stages, including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The report then delves into data preprocessing and transformations, discussing data cleaning, integration, transformation, and reduction techniques. The core of the project involves selecting and implementing a specific classifier, as assigned in the brief, to predict a class attribute within a given dataset. The report details the approach taken to address the problem, the classification techniques utilized, and a summary of the obtained results, including parameter settings. The report follows the provided assignment brief which is about implementing a classifier to predict the class attribute. The report concludes with an evaluation of the classifier's performance and a discussion of the findings.

Introduction to Data
Analytics

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Introduction.........................................................................................................................................1
Data mining..........................................................................................................................................2
Data pre-processing and transformations.........................................................................................4
About the problem...............................................................................................................................8
Classification techniques.....................................................................................................................8
Actual classifier....................................................................................................................................9
Conclusion..........................................................................................................................................16
References..........................................................................................................................................17

Introduction
Data mining is about identifying new information in a lot of data. The information
gathered from data mining is both new and meaningful. Data mining is the process of finding
anomalies, patterns and correlations within large data sets to predict outcomes with the help
of software tools. The data collected may be evaluated steadily and help to identify
relationships and find answers to the existing problems. For organizations and
businesses, data mining is utilised for discovering trends, patterns and relations in the
selected data in order to assist in making better business decisions. Data mining can aid in
finding the sales trends, develop smarter marketing campaigns, accurately predict customer
loyalty and thus move closer the objective of their company. The various tasks involved in
data mining includes selective data collection, storing the data, filtering and sorting the data,
processing the data, identifying the trends etc. For all these activities, Data mining makes use
of sophisticated and complex algorithms and mathematical functions. Collection of the raw
data which shall be further used by undergoing the process of Sorting, isolating unrequired
data, modifying and introducing variables, shall be the first step out of the total three stages in
data mining process. Sorting out of all these data helps in making the analysis procedure
simple. Identification of the important parameters and nodes for their complications should
be done by making use of the various procedures on graphical evaluation and statistics.
Which brings us to the most valuable and vital method which is data pre-processing in data
mining. The final project result and outcome will be severely affected if the method for
collecting the data is incorrect, errors in data values, undefined parameter usage etc. Thus, the
importance of data pre-processing is important for the quality and data collection method.
Selection of important identification during the training phase is more strenuous if not much
attention is paid to unrelated and unneeded information present or noisy and unreliable data.
This preparation of data might take more time as involves lot of activates like includes
normalization, cleaning, modification, feature withdrawal and selection, etc as it has to
prepare the final training set for the data mining procedure.
Data mining
Data mining is a procedure utilised by firms to modify the raw data into usable data.
By making use of software tools to search for patterns in large batches of data, business
can learn further about their customers for developing better and effective marketing
strategies, reduce overheads, increase sales etc. For increasing the revenues, reduce the costs,

bettering the relationship with the customers, minimizing the risks etc, data mining makes use
of various tools like artificial intelligence (AI), machine learning, and predictive modelling.
As per data mining researchers, the procedure should follow the same path which consists of
6 important steps as:
1. Business understanding
This Phase of business understanding consists of, first, understand the business
requirements and also at the same time understand the objective to have a picture of the
businesses goal (Gilmore, Hofmann-Wellenhof and Soyer, 2010).
Next, the vital factors like assumptions, resources, constraints etc should be identified to
make a full appraisal of the given situation. Then, within the current situation, make data
mining targets to achieve the business goals. And lastly, both data mining goals and the
business goals should have a good plan for data mining. A detailed plan should be made for
this.
2. Data understanding
From various data sources the data shall be collected, to help in knowing the data,
which understandable starts with initial data collection phase. In order to create this data
collection succeed, a few vital activities must be carried out which includes data collection,
data load and data integration.
Carefully study the collected data and reported it specially the “gross” or “surface”
properties. Using the parameters of querying, reporting, and visualization, the data has to be
explored for the queries raised by the data mining methods. Answering vital and significant
queries will help the quality of the result of the data mining procedure.
3. Data preparation
With almost 90% of the project’s total time taken up by the data preparation part, the
final outcome of these entire data configuration is the dataset. This is then cleaned,
constructed, modified and formatted as per the project requirement and the desired form. On
the basis of the business understanding, the task of dataset exploration with a deeper depth
may be carried out during this stage.
4. Modelling

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Choose the best modelling methods initially for utilising them for the prepared data
set. For quality/ quantity validation of the model, must examine by creating a test scenario.
More than one model is prepared for the scenario as required. Now, to confirm that the
created models that we had selected are actually fulfilling the business initiatives, they have
to be evaluated thoroughly by implicating the stakeholders.
5. Evaluation
In the first phase, the results of the models should be analysed in the evaluation phase
for the context of business objectives. Because of the emerging new trends that have been
identified in the model outcomes or from other factors, for this stage, requirements needed for
new business can be raised. Data mining, as part of its iterative procedure, involves gaining
business understanding. Before moving to the deployment and the final step, the decision has
to be made whether the data set is final for it or still modifications are required to be carried
out.
6. Deployment
Stakeholders should be able to make use of the result and the outcomes of the data
mining process so it can be used anytime they would like to. The deployment phase may be
as easy as making a draft report or as complicated as a repeatable data mining procedure
across the organization which, will be based on the business needs. We have made the
planning for the maintenance, deployment and monitoring, and shall be utilise this for
implementing for future supports in the deployment phase. The final result as seen from the
overall project experience, for which the summary of the project experiences and reviewing
of the project will see what needs are to be improved for creating learned lessons.
Data pre-processing and transformations
Data Pre-processing Methods
Data pre-processing is a data mining method and procedure which involves modifying
the raw data into a well-defined usable format. Normal data is mostly inconsistent,
incomplete, and/or not having certain behaviours or trends, containing errors and faults. Data
preprocessing is a well proven procedure for resolving such issues. Used for making the raw

data presentable, usable, and manageable, Data preprocessing is the vital step in data mining
procedure (Richards and Davies, 2012).
The following sections are in which data preprocessing methods are divided into:
 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
Data Cleaning
Whatever be the type of data, data quality is vital. Old, inconsistent and inaccurate data can
have a decisive impact on the outcomes. Data cleaning, also called data cleansing, is the
procedure for making sure that the data is correct, consistent and useable by locating for any
errors or corruptions in the data, correcting or deleting them, or manually processing them as
needed to prevent the error from happening again. The manual part of the process is what can
make data cleaning an overwhelming task. While much of data cleaning can be done by
software, it must be monitored and inconsistencies reviewed.
With the advent of software’s and algorithm tools, the process has become simpler.
Still there are issues while data collecting like errors, wrong collecting methods, data format
is incorrect, columns are in wrong order etc. Various more difficulties relying on the situation
may include data enriching with additional data on the fly, modifying and changing the
schema, and error detection. Some of the causes of the data mismatches are,
 Human error
 Aging (data such as contact information degrades over time)
 Omissions due to optional fields in forms, or merge errors.
Many of the vital tasks such as record matching, data standardization, data
duplication, and data profiling are, part of the data cleaning procedure for providing the
necessary solution. We have to understand that even though data cleaning has been
recognized by most of the industry scholars as the most vital clog in quality data
management, there will be some point of error creeping into the data at any point of the
procedure which has to be minimized to manageable levels. We have to understand these
errors and taking steps in a proactive way so that they can be recognized and suitable
measures be taken to accommodate this errors in the final outcome. With the number of client

base and customers in the system, this becomes important and steps should be taken by the
organization and the staff to recognize these errors.
Missing Values: If we have recognized that large number of tuples are present with no
recorded value for several attributes, than the missing information shall be entered in for the
attribute by different types of procedures as given below:
1. Ignore the tuple: This method is not very effective and is utilised if the missing is the
class label (assuming the mining task involves classification or description). It becomes more
effective when the tuple has lot of attributes as well as missing data. When there is
considerable increase of the percentage of missing values per attribute, than these results in
very poor outcomes.
2. Fill in the missing value manually: In general, when there is a huge data involvement
with lot of missing values, this method takes too much time and will not be feasible.
3. Use a global constant to fill in the missing value: Using label like “Unknown", or -
∞, replace all the missing attribute values by the same constant. If “Unknown" is the replaced
value for all the missing attributes then, mining program will by mistake start thinking if they
make an interesting notion, as all of them have a similar value in which is the value of
“Unknown". Thus this methodology is not advisable even if it looks very easy.
4. Make use of the attribute mean value to enter all the missing values.
5. Make use of the mean attribute value for all the given samples for the given tuple
belonging to the same class.
6. A Bayesian formalism or decision tree induction is utilised which are the inference-
based tools, we can to enter the missing value use the most probable value. There may be an
error in this filled up value as methods 3 to 6 are one-sided and the most popular strategy is
method 6. This is because it uses the almost all the data and its information from the current
data to forecast the values missing when it is compared to all the other methods.
Noisy Data: In a measured variable, the variance or a random error is called as “noise”. How
shall these be “smoothed" to make sure that noise has been removed in a numeric attribute
like “price”? Below is the procedure for the data smoothing techniques:
1. Binning methods: By consulting the “neighbourhood” or values around it, a sorted data
which by the method is smoothed out is called as Binning. Into a number of 'buckets', or bins,

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

these sorted values are distributed into. These carry out the local smoothing around it these
values, as binning methods consult the neighborhood values.
2. Clustering: Similar types of values grouped together in an organization method or \
clusters", where the outliers can be detected by clustering.
3. Computer and human inspection: By a mixture of human inspection and computers, these
outliers may be discovered. In one application for assisting to recognize these outlier trends
in a manual character dataset for categorization, like use of the information-theoretic. With
respect to the known label, the value measured shows the “surprise" content of the forecasted
character label. They can be either being waste (mislabelled characters) or important data
(e.g., identifying useful data exceptions). Outputs to a list are those selected trends where the
surprise content crosses the threshold benchmark. Thus to identify the actual unwanted ones,
we can then filter through the patterns in the list.
Manually searching the complete database is very time consuming method and to avoid this
we use these filter patterns. Later on, from the training database these garbage patterns can be
removed.
4. Regression: Such as with regression, smoothing of the data can be carried out by linking
the data to a function. Using one variable for predicting the other variable, linear regression
method is used for finding the \best" line to fit these two variables. When there is
involvement of more than two variables than we can make use of the “Multiple linear
regression” and where multidimensional surface is where all the data is fit into. This is the
best way of smoothing out any noise generated by making use of the regression in finding a
mathematical equation.
Inconsistent data: For some transactions, in some cases there may be inconsistencies in the
recorded data which, again could be rectified by manually using external references. (E.q by
performing a paper trace, correction of data entry errors may be carried out). To assist in
rectifying the inconsistent utilisation of codes, this may be coupled with routines designs. To
identify the violation of known data constraints, use of knowledge engineering tools can be
utilised (E.q to, functional dependencies between attributes functional dependencies can be
utilised for finding the values contradicting the functional constraints).

Data Integration
Data integration is a procedure where the heterogeneous data is collected and mixed in an
incorporated form and structure. Data integration helps in making all the various data types
(such as data sets, documents and tables) to be combined together by the user, organizations
and applications, for utilizing for personal or business processes and/or functions. Data
integration basically aids the analytical processing of huge data sets by the methods of
aligning, combining and implementing each of the data set from the organizational
departments and external remote sources to satisfy the integrator goals. This methodology is
normally utilized in data warehouses (DW) via the specialized software which hosts huge
data repositories from external and internal resources. This shall than extract the data,
amalgamate it and finally present all of it in a combined form. Like for example, a user’s
entire data set shall include the extracted and include the data from the marketing, sales and
operations, which are mixed, all together to make a complete report. In a typical data
integration method, the client’s request is sent to the master server for the data. The master
server shall in take the much needed data from internal and external sources. The data is
extracted from the sources, and then consolidated into a unit cohesive data set. This shall be
resend for use back to the user.
The user’s data which gets collected in all the servers is huge and this will keep on growing
with more and more devices and usage becoming common. Collection of data is important
but more vital will be the filtering and sorting out all these raw data for making use of only
the most important and usable data. And this shall be the challenge faced for gathering the
potential value based data. Metadata is another factor present in Databases and data
warehouses. These assist in avoiding errors in schema integration. An attribute may be
unusable if there is a possibility of deriving it from some other table, like for example annual
revenue, redundancy is another important issue.
Data Transformation
Data as information is very important for the day-to-day operations for every organization.
Sadly, all these data can also be collected in servers which ultimately are just redundant and
inconsistent data. To make it more meaningful, usable and to harness and more valuable to
the organization, its vital to combine all these information servers and to leverage existing IT

assets to create more flexible, agile enterprise systems. Data transformation is the key
here. The features of data transformation includes,
1. Smoothing helps in eliminating the excessive noise from data by making use of the
following methods like clustering, binning, and regression.
2. Normalization, so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0, e
the attribute data is scaled.
3. Aggregation is when data is subjected to summary or totalling operations. (E.q for
computing the monthly/ annual total amounts, the daily sales data may be added up). This
methodology is mainly utilised for making a data cube for evaluation of the data at various
granularities.
4. Normalization of the data, through the use of concept hierarchies by higher level concepts
which replaces the lesser level or 'unrefined' (raw) data. (E.q upper level concepts, like city or
county for categorical attributes, like street, can be generalized to). Same way, we can map to
upper level concepts, like young, middle-aged, and senior the numeric attributes, like age.
Data Reduction
Analysis sometimes makes it impractical or infeasible as these are complicated data
evaluation and mining on big amounts of data may take lengthier time period. Data
reduction is the process of reducing the amount of capacity required to store data. Data
reduction can increase storage efficiency and reduce costs. Datasets without allowing the
integrity of the initial data and yet giving the best knowledge can be carried out by data
reduction techniques in analyzing reduced representation. This can be done by minimizing
the data volume or data attributes. The outcome is same like the original data/ attributes but
now the method is simpler and takes much less time. Separation and partition of data tuples
can assist in making the mining work on less data but same time giving the same analytical
outcome and result.
About the problem
Now we shall classify and find the solution of the final “Final_Y” attribute. We can
implement the various pre-processing and transformations (e.g. grouping values of attributes,
converting them to binary, etc.) by giving explanation as to why we have selected it.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

We shall divide the training set to a training sets, validation and training, to make it more
accurate as a set of variables and for analysing the classifier’s quality. All these workflow
begins with details of documents, which have been downloaded from PubMed and before
itself as a data table it is parsed and stored, and also same time in the corresponding drop
directories the data is created like a drop file. These shall be divided into two sets of
documents and into two categories, which shall be on the basis of division being category
assignments (Richards and Davies, 2012). Using different and various types of filters, the
stemmer node and the textual data are reprocessed. Representing the various numerical
representations from the extracted keywords, and the document vectors are made from these
documents. For the classification we shall be making use of k nearest neighbour, vector
machine and Decision Tree.
Classification techniques
Logistic Regression
For classification, logistic regression will be a machine learning algorithm. Modelled
using a logistic function, this algorithm uses the probabilities by narrating the likely outcomes
of a single trial.
Naive Bayes
The supposition of independence in-between all the pairs of features Bayes’ theorem
is used for the Naive Bayes algorithm. This works well such as classification of the document
and filtering of spam in many real-world situations (Sarwar, 2017).
Stochastic Gradient Descent
To fit the linear models, a very useful and simple approach is the stochastic gradient
descent. When the number of samples is very huge, it becomes effectively useful and this
supports the various function loss and fines for classification.
K-Nearest Neighbours
Simply storing instances of the training data, the neighbours based on the
classification for the lazy type learning as it does not try to construct a normal internal model
(Stoltzfus, 2011). For each point, a normal maximum vote of the k nearest neighbours will be
taken for the classification is computed.
Decision Tree

A decision tree produces a sequence of rules that may be utilised to classify the data
and given by the data of attributes together with its classes.
Actual classifier
Decision Tree
Decision Tree is a type of Supervised Machine Learning (where it has to be explained
what will be the input and the corresponding output is decided by the training data) where the
data is continuously split as per the required variable character as selected. Hierarchical
decision from the result variables is based as per the predictor data, to find the answer by the
decision tree making the sequential. For the distribution of the errors or the data, there is a
non-parametric procedure means that there are no underlying assumptions, which directly
implies that based on the observed data, the model is created. Classification Trees are type of
decision tree models in which the target variable make use of the discrete set of classified
values are (TOFAN, 2014). In the decision tree, the branches represent the conjunctions of
features leading to class labels and the node, or leaf, represent class labels. The classification
rules are the paths from root to leaf. One of the best and mostly used supervised learning
methods, with high accuracy, stability and ease of interpretation, is the Tree based learning
algorithms methods and they empower predictive models. We have type of decision trees in
which target variables take continuous numerical values, and these trees are called as
Regression Trees. CART (Classification and Regression Tree) is the term which is used to
describe together these two types of decision trees i.e Regression and Classification Trees and
the CART is a model for Directed Acyclic Graph.