Data Mining with Weka: Methodology, Analysis, and Predictive Modeling

Verified

Added on 2022/10/04

AI Summary

This report details a data mining assignment utilizing the Weka toolkit. The analysis begins with data preparation, including cleaning, attribute selection, handling missing values, and addressing duplicates. The methodology section outlines the preprocessing steps, such as converting data types and scaling numeric attributes. The report focuses on building classification models, emphasizing the importance of nominal data variables within Weka. The author discusses the identification and removal of irrelevant attributes, and the transformation of data into ARFF format. Key concepts such as confusion matrices, ROC areas, and the significance of various statistical measures are explored. The report underscores the application of machine learning techniques, specifically classification, for predictive analytics. The author uses a telecom example to show the importance of data mining in business.

Data Mining in Weka
Name of Author
Class
Professor Name
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Last Name 2
Summary
Weka is not a programming tool hence one of the reasons as to why it is loved as a data mining
tool kit. It gives the best of a platform to those that do not have programming skills as these
people can perform machine learning activities using Weka tools and put them at par with those
students that do know other machine learning tools that require programming knowledge. To
start with Weka has several tools that range from data cleaning where instances, as well as
attributes, can be cleaned and deleted. This, therefore, makes it very possible to help tidy up the
data that is required for use before any farther analysis or development of machine learning
algorithms. The attribute types can be changed from what is known as a numeric, string, nominal
to the other of the three that have just been listed (Alam, Farhad, and Sanjay. 2017). The reason
as to why the change from one data type to the next is just as important is because there is need
model data attributes to a form that can be accepted by most of the machine learning algorithms.
One major finding is that Weka classification does not give the classification on results on a
variable that is in numeric form and only does that in a case where the attribute variable is a
nominal variable. The real truth is Weka only works best with data variables that are in nominal
form. Nominal data variables being readily accepted by the Weka data mining tool kit keeps it
ahead of most other analytical tools as there are things like better visualization and summary
statistics that are very easy to understand (Alcala-Fdez et al. 2016). There is a clear scatter plot
and the machine learning algorithms developed from nominal data attributes have well-classified
confusion matrix with detailed data points. The confusion matrix comes after the development of
the detailed summary statistics have been developed as there will be the errors section which
contains the correctly classified instances as well as the Kappa Statistic which is of great
importance while making references (Asri et al. 2016). The only stuff that is not as important

Last Name 3
under classification is the error results like these only come in handy when doing a regression
classification or a linear regression model build-up (Arganda-Carreras et al. 2017). Also, the
classification of a nominal variable brings with it a clear area that stages a better understanding
of what we have for the True positives, the false negatives, the F-measure that originates from
Precision and the Recall. Looking at the ROC area gives the area that is covered by the graph
given out by the confusion matrix. The ROC area again has a curved tip that tends to give the
best fit for the best results of the summary statistics (Bahari, Femina, and Sudheep. 2015).
The dataset was realized to be very dirty and it had an irrelevant variable like any other dataset
that is to be analyzed. Therefore, like any other task that every data analyst and a data scientist
have, there will be the cleanup of the provided dataset. All of these is done in a bid to help make
the dataset only contain the most relevant variables that could be used for easier analysis to help
give the best in results.
Weka was developed under a science project that went over 20 years at the University of
Waikato and the developers of literature also known as authors that do support the existence of
Weka itself have not disappointed. There is a tone of books both in the online space as well as
hard copies that aid as references while using Weka. On YouTube as well one can easily get to
grab the videos that aid in the development of algorithms are faster and save themselves the
hustle of having to go out there seeking for books that would either cost lots of cash in spending.
The existence of a varied pool from where Weka data mining materials can be gotten from
makes Weka highly flexible and easy to implement (Baid, Palak, Apoorva, and Neelam (2017)).
Form a learning institution to organizations that rely on data mining and the development of
machine learning algorithms. In learning institutions, Weka is used in the execution of project
analysis given by lecturers to students. In organizations, which in most cases are profit-driven,

Last Name 4
we tend to have the development of machine learning algorithms to help aid in the continued
profits that are expected to be earned by the company at hand. In profit-driven companies, the
most used machine learning technique is the classification technique and what this translates to is
that there is usually a binary or a multi-classification with the aim of putting products or
customers in groups that would later aid in the better understanding that leads to right actions and
therefore the best decisions that drive the growth in profit that is required (Bravo-Marquez et al.
2019). Taking the telecom example where customers that are loyal and those that are not loyal
are usually classified under the churn variable in the customer churn dataset. This dataset is
mostly used by firms in a bid to get to know how to classify the customers that they have got.
Therefore, they can get to know that of the group, which are the loyal and the less loyal
customers after which there can be measures taken to help retain the customers that are otherwise
seen as those that would churn. By actually working to retain customers that would otherwise
churn from a service provider, the involved company, therefore, saves more money that would
otherwise be lost while trying to win over new customers to help fill the gap of those that had
churned before (Brooks et al. 2016).
Machine learning and data mining as it is, with the current trend in technology, people have
taken a different turn and tend to focus their analysis on machines rather than the pen and paper
type of analysis. From all of this, there is, therefore, a precise view on how tools like Weka can
be of aid especially those who lack programming skills.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Last Name 5
Methodology
i. Data Preparation
To start with data is uploaded into Weka in CSV format as the data type to be analyzed is given
in a CSV file, a file that is easily accepted by Weka data upload formats. For the load up to be
successful, go to processes, open file, the select dataset from the dataset folder where it is and
specify the data type and then select open. The data will be loaded and ready for preprocessing.
The main task of the whole assignment project will be to help build a classification machine,
learning models. This on the other hand actually as is known only accepts numeric data
variables. Therefore, without getting ahead and getting to go too far, there will be the
preprocessing and an actual remain of the dataset that is expected to be used for analysis and
everything that will follow in building the machine learning models that are to be built (Bunker,
Rory and Fadi. 2017).
To decide the actual variable to delete, there is always an option, especially when using R
software for building machine learning processes, of building a linear regression model. The
summary of the model gives a lesser number of starts on those models that have lesser to no
statistical significance in developing data science models during the developments of the
machine learning models that might be needed at a specific algorithms build up session
(Chandrasekar et al. 2017). At the time of the linear regression build-up, the variables that
usually are considered to be of no or less statistical significance are usually the nominal
variables. The confidence levels of these variables are usually very low and therefore need to be
taken out or deleted as variable sets. Another option that can be used is to change the said
variables into numeric variables to get to improve their level of significance. The value that they

Last Name 6
add though is way lower and the deletion option is the best in this case (Chaurasia, Vikas, and
Saurabh. 2016).
Since we are working on Weka, the variables that will need deletion can be viewed from the edit
mini-tab that is found under the processes tab. This will give the alignment of the required data
variables and the actual listing of which is nominal, which is numeric. Not all numeric variables
will stay in this case, there are extra empty variables that do not have to fill up criteria and will
have to be deleted no matter what. From the edit tab in Weka, the nominal variables that will be
deleted include; att1 to att12 and the last nominal values that will be deleted will be att13 for
being nominal as well as having extra empty cells and att14 as well. The numeric variables that
will be deleted will be the ID and att19. There is nothing absolutely ID adds to the models that
will be developed and its only purpose is to serve in the numbering of the instances that are to be
considered in building models. Since Weka has its numbering, the ID column would then be
considered to be less significant and hence would prompt a deletion. Attribute att19 is extra
empty and there is absolutely no fill-up method that can be used to help fill up this case and will
go through the elimination process as well. The deletion process will be as in the case described
in the paragraph that follows below.
Under processes, one can easily go to choose a filter and select unsupervised filters as these are
used to remove attributes. At this stage, one goes ahead to select the attribute filter after which
one selects to remove (which is what is used to remove or delete an irrelevant attribute from a
dataset). Click on the remove that appears next to the choose button then select the range of
attribute to be deleted in the process from the weka.gui.GenericObjectEditor. the numbers are
entered at the attributeIndices section. Click okay and apply and watch the number of attributes
to be deleted gone. If you selected to delete an attribute or attributes by mistake, then one can

Last Name 7
easily select the undo tab to reverse the whole process and end up making new selection or
selections (Choudhury, Sumouli, and Anirban (2015)). On the other hand, there is a simpler way
of getting to delete irrelevant attributes from the set of attributes that have been provided for use.
One simply selects an attribute or attributes to be deleted by ticking on the boxes that are on the
left of the attributes of interest and then clicks on the remove tab on the lower end of the
attributes. But of course, in our case, we will choose the simplest option and after the deletion,
what will be left after the deletion of the variables will be the numeric variables. the edit mini-tab
gives the visual below after clean up;
Taking a closer look gives an empty instance under the att28 variable and this is instance 18 as of
the numbering of Weka, this would only mean that there needs to be a fill-up of other empty
instances that are a miss as in this case.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Last Name 8
The further examination gives the value of 1.0 only for the att17 and this means that the standard
deviation that was depicted on this attribute is zero and therefore, there is no variability
throughout the entire attribute. this the prompts that there must be a deletion of the entire
attribute as well seek to clean the data even farther. Form there then, the dataset, can, therefore,
be transformed from a CSV format into an ARFF format. The transformation can be executed
simply as the data that had been uploaded as a CSV file can then later be saved as an ARFF file
directly from the processes tab. The folder for which the data should be saved is then decided
from there onwards (Fatima, Meherwar, and Maruf. (2017)).
a. Missing Values
In data science, data mining and machine learning what brings up headaches to most data
scientists are dirty data. Data, therefore, has to be wrangled in most software that is to be used
for data science and one of such wrangles is filling missing values that are there in an attribute.
We have deleted irrelevant attributes like in the above case and this also is one other step of data
wrangling or data cleaning or data preprocessing (Flannery et al. 2015). In this case, though, the
focus will be on filling of the empty cells. There are options in doing this kind of activity as the
mean or the median can be used for doing the fill-ups of the empty cells (Feurer et al. 2015). For
one to know which of the variables in the list has a missing value, one needs to check the
descriptive statistics provided for a variable by actually selecting on a variable and taking the
readings that the selected attribute area gives. Doing the same it is realized that all the variables
except Class, att25 and att28 have no missing values. The missing values that are there in
attribute class are 100 in total and this makes up to 9% of the total instances of the attribute.
Attribute att25 has a total of 3 missing values which is below 1% of the total attributes number.

Last Name 9
The third and yet the last variable with missing entries is the att28 and this gives a total number
of 6 missing variables and stands at 1% of the total values instances that are recorded.
For one therefore to replace missing values, there is need to click the choose button under the
filter function, choose filters and the unsupervised, then finally one selects the
ReplaceMissingValues with a constant tab after selecting a variable that has a missing value
(Flannery et al. 2015). After applying the same, the missing values will have turned to zero and
this is an indicative sign that they have all be filled. The same can be done to all other attributes
and finally, all of them will have been filled. The only attributes that will be filled though, are
only the att25 and att28 as the Class which is the attribute class and that has its last 100 instances
to be used for prediction, will be left without being filled as the last 100 for the Class variable
will be predicted during the model testing process and therefore it will be the test set (Dwivedi et
al. 2016).
Since the models build up and test process requires that we must have the last 100 instances
serve as the test dataset, there will be need of saving the finally cleaned as a CSV file and
splitting it farther before loading one as a test set and the next as a train set. In this case, the load
up of the cleaned dataset will be done into Weka, except for the class attribute which has 100
instances that are to be classified, therefore, the Class attribute will not have its last one hundred
(100) filled. For inclusivity, the split-up of both the trainset and the test sets will be done in
Weka. This is an exercise that is done in Weka and the files given in the assignment submission,
there is an inclusion of both the train data set which contains up to 1000 instances and there is
the test set which contains up to 100 instances within which there is an empty attribute called the
Class attribute.

Last Name 10
The clean splitting of the dataset into training and test sets is also done through the filter mini-tab
under the processes section. Then there will be two choices to choose from and these are the
unsupervised and the supervised unsupervised filters. From the unsupervised filter, there is a
need to select instances for this is what were are to work on while doing splits on datasets into
train and test sets (Dwivedi et al. 2016). From here, to get test data set, one simply deleted the
instances that have their cases filled and the ones that had empty entries in the Class variable will
be left for saving as to be used as the training set. The training set on the other hand though was
processed in CSV file of the cleaned dataset but was saved under the same. Later the train set
was uploaded back to Weka for saving into an ARFF format and with the attributes given as
nominal attributes (Jain et al. 2017). There reason for the transformation of all the attributes into
nominal from numeric, is because in numeric forms, there can never be the development of most
classification algorithms like J48, Naïve Bayes and k-NN and the only way to actualize the
smooth development of the models is if all the attributes are all transformed into nominal
variables (Kalmegh, Sushilkumar. 2015).
ii. Data Classification
Weka as it is a non-programing data mining and machine learning software that had been
developed by the intellects of Waikato University to help the less fortunate students. The
category of the less fortunate students referred to in this case are the statistical or mathematical
students who would actually want to have analytical work, that are given in-class assignments
and projects, done on data analytics software but on the other hand they have little to no
knowledge of doing the development of the data mining and machine learning algorithms
through the platforms that require coding knowledge like python and R (Koliopoulos et al.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Last Name 11
2015). Such cases prompt an individual to actually get to use the non-programming software like
Weka and Rapid minor. Weka has several functions and is largely very easy to explore as there
are several windows for use and exploration. The libraries are already developed and are
installed and are up and running for analysis and the development of the required machine
learning models. Tabs are just clicked for execution and this makes such non-programming
machine learning software easy for use in that case (Kotthoff et al. 2017)
In this section, there will be the deployment of the classification using classification algorithms
on the dataset that has been provided as given in the assignment. There is a need for the Class
attribute to be classified through three different classification algorithms and later be used to test
the last 100 cases that need to be predicted. The major problem that there is a great imbalance in
the instances that have been provided, this though will negatively sway the classification and the
classification will end up putting most or even all the instances under one class and leaving the
other class without instance classification at all. The imbalance problem can adequately be
addressed by the use of Weka itself (Kulkarni et al. 2016). There is a clear procedure in
addressing the imbalance of a dataset through meta option under the classification section. This
option addresses the imbalance problem in such a way that it passes a degree of penalty to every
classification algorithm that tends to make a lot of wrong classifications. This way, the models
learn that there is a penalty and tends to make the correct classifications (Lang et al. 2019).
One thing that is important to note is that a model can give the highest percentage in
classification and yet be the worst model. This happens when the model gets to classify instances
in one class or few classes and forgets other classes. This then makes the summary results to
miss lots of descriptive statistics and therefore for the best summary statistics during
classification, there must be a balance of the dataset and this can only be done through the use of

Last Name 12
penalty application. The penalty can be applied to any machine learning model that is developed
under Weka and there is no way there will be different approaches to dealing with different
datasets applied for the sake of building different algorithms. Once a balanced formula has been
applied, one is likely to witness a reduction in the percentage performance of the correctly
classified instances down from misleading results to results that are way lower but that give the
best of classification and that all classes are always considered in such cases as opposed to earlier
cases (Lausch et al. 2015).
In the bid of classification, two major scenarios are to be examined, when the data is imbalanced
and when the dataset is balanced. For the imbalanced case, results will be used for comparison
with the latter case where data is balanced. It is the latter case that will have its results used in
classification in the long run. A look into the kind of statistical summary that is reached is, there
is the true positive rate, the false positives, the precision, the recall and the F-measure. The ones
that measure the variability of the developed model include the ROC area (Receiver Operator
Characteristics area), where ROC is an area under the curve that the model poses the highest
value in terms of threshold when developing the confusion matrix that gives the best of
performance results in classification. There is the topmost of the summary area that gives the
correctly classified instances in terms of numbers and percentage. The higher the classification
percentage the more the correctly classified instances and only if the dataset is correctly balanced
as required. Some errors are witnessed and are only viably reported in a linear regression case,
and not viably reported in a classification case. There is the kappa statistics that shows the
probability degree of the classification model to make the best prediction. The closer the value is
to one the better the model. Therefore, kappa statistics plays a very important role. Moving to the
incorrectly classified instances, it is evident to see that the number that is incorrectly classified is