Data Mining Project: Implementation and Analysis with Weka

Verified

Added on 2022/10/04

AI Summary

This document details a data mining project utilizing the Weka software for predictive analytics. The project begins with data loading and preprocessing steps, including handling missing values, removing irrelevant attributes, and addressing duplicate entries. The author describes the process of splitting the dataset into training, testing, and validation sets. The project focuses on the implementation of machine learning algorithms within Weka, including J48, Naive Bayes, and KNN (IBK), to build predictive models. The author also addresses the issue of data imbalance by applying a cost-sensitive classifier. The project aims to predict the class labels of a set of unknown samples, with the overall goal of demonstrating a comprehensive approach to data mining and predictive analysis using Weka.

Data Mining in Weka
Name of Author
Class
Professor Name
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

WEKA DATA-MINING STEP BY STEP
a. Data Load-up
Data load-up onto Weka can be done in several formats but only one way. The only way to load data
into Weka is through the open file tab under the processes section as shown in the screen-shot below;
After the load up of the dataset (which was in a CSV format) using the open file section, then the
appearance of the data set must be something like the illustration that has been given down below;
From the uploaded file, going through the name variables to have a look at the characteristics of the
variables, it is evident that some variables do have missing values. In addition to that, all the variables
were numeric. Missing variables in a dataset bring difficulties when doing analysis and the results are
jeopardized. The irrelevant variables can be deleted and the missing variables can be filled up as well.
There are two ways of deleting irrelevant variables that have been identified. One of the ways is through
the use of the filter function but in this case, there will be a focus on the simplest way of variable

deletion. Upon the arrival to the decision to delete the required variables, then one can tick the boxes
that are on the left of the variables that are to be deleted and then click the remove tab on the below
and then delete them all at once. The process is as shown below;
From the above figure, it is evident to see that there are illustrative ticks and the Remove button on the
bottom from where attributes can be deleted from all at once.
For the missing and the change if the variable type from numeric to nominal attributes types, there will
need to use them;
 Filter subsection in the processes section.
 After which the unsupervised filters are selected and then later than the replace missing values
is selected and then filled using the median or the mean of the attribute in question and then
applied and this will be effectively applied.
 For the change of attributes’ types, there is only one approach and one goes to the attributes
section under unsupervised filters and then select numerictonominal. When this is applied, all
the variables will be changed from numeric to nominal that can be used for analysis.
 After the change of the attributes from numeric to nominal, one can get to then remove
duplicates from both the attributes and the instances.

 For instances duplicates values removal, instances under unsupervised filters will be used and
once remove duplicates is selected, then from the processes tab, the deletion can be enacted.
From the above then we have a cleaned dataset, with no missing values except for the last 100 variables
of the class attribute that are supposed to be used for prediction.
b. Saving and Splitting of Variables
After the load-up of the dataset and the cleanup there, we need for splitting the data into a test set,
train set and validation set. The data was loaded as a CSV file. The training file will be the first part of the
file that consists of the 1000 instances. The part of the data, therefore, that is supposed to be split into
both the train and the cross-validation data is the last data-set that consists of the 100 instances with
the Class attribute as an empty variable. Doing a step by step categorization of the three different
datasets;
 Last 100 Class empty instances will be deleted to aid in the retaining the first 1000 instances as
the training dataset. After the access of the first data set from the whole dataset, this can be
saved as train data and then be loaded later when trying of the train, the machine learning
models.
 The second and the third datasets, which will be the test and the validation sets, will come
from the set of the last 100 instances and in this case, there will be the need to actually choose
to resample under instances and in this case, and then split the dataset into 50% ration for
both train and test sets. After the resample option has been selected from the filters section as
required, before applying it, one is required to click on the Resample section and what appears
will be as shown below;

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

From the above, there will be the need to have to set debug as false and the sample size percentage as
50% for the test set. For the validation set, one needs to set the debugs as True after which the sample
size percentage will be set as 50%.
Once all is set then, one can move unto the machine learning section.
c. Machine Learning.
This is the final part where all the three algorithms will be developed. It would only be wise if the models
are developed under the classify tab and the training dataset is run like a train data but that in this case
will not be the case as there will be the need to run up to 10 folds on the developed dataset. The dataset
that has been provided through has a very serious issue and is imbalanced and the imbalance problem
needs to be addressed what's so ever to help give out the best data results.
As shown in the main report other models and one noted one that gives the same results as for both
when data is balanced or imbalanced is the decision tree (J48).
For the normal development of the trained model, one goes to the classified section and selects choose.
This is only done after the load up of the required dataset for use. the train of the models with an
imbalanced dataset is only done to aid in getting to know the difference in the overall results that are
expected to be, therefore.

As of the above screen-shot, one clicks on the choose section as the top to have a list of all the machine
learning options. From there, there will be the choice to pick from J48, Naïve Bayes and the KNN which
is referred to as IBK under the lazy set of the required algorithms.
For the balanced data usage. There will be the need to have to use the meta classifier and the actual
choice made will be that of the cost-sensitive classifier and in this case, the window that appears below
will pop up in any case the cost-sensitive mini-section is clicked. In the classifier section, one gets to pass
the classifier that they need to pass through to train a model.
At the start tab, the model will be developed and the test set will later be passed to help develop the
right predictions as required. Which can then be saved as a CSV file?