Object and Data Modelling: Weka and Data Mining Applications

Verified

Added on 2022/09/30

AI Summary

This project undertakes an in-depth exploration of object and data modelling, leveraging the capabilities of machine learning models within the Weka environment. The study begins with a performance evaluation of five classification models—MultilayerPerceptron, Naive Bayes, J48, RandomForest, and REP tree—across three distinct datasets: Iris, Breast Cancer, and Diabetes. The analysis involves descriptive statistics, graphical representations, and a comparative assessment of model accuracies, false positive rates, and key parameters like precision, recall, and ROC area. The project then transitions to data mining applications, defining its purposes, including revenue increase, customer relationship improvement, and risk reduction. The stages of data mining, encompassing data sources, exploration, modeling, and deployment, are outlined. Finally, it delves into real-world applications, particularly focusing on revenue enhancement and customer relations within the telecommunications sector, emphasizing the use of predictive models to mitigate customer churn. The project references various research papers to support its methodologies and findings.

Running Head: MODELLING 1
Object and Data Modelling
Name of Student
School

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MODELLING 2
Classification Performance Evaluation (Task 1).
In this part, there will be an evaluation of the performance of five machine learning models in
Weka. Weka is a non-coding machine learning software from Waikato University and it helps
curb most of the troubles that students who are not good in coding do have, like for example, a
mathematical student might want to get insights on the statistical relations that the variables have
and might decide to use statistical software. In any case, the student lacks the coding skills that
are required for example in the platforms such as Python and R then the student an always opt
for non-coding software and in this software we do have Weka and Rapid Minor as the best
examples.
Three datasets are provided in this case and each is, therefore, has to undergo five classification
machine learning models. The models' performances are then supposed to be compared in bid to
get the one that has the top most accuracy in the whole process. The models that have been listed
for comparison include; MultilayerPerceptron, Naive Bayes, decision tree’s J48, RandomForest
and RERTree (Kotthoff, Thornton, Hoos, Hutter & Leyton-Brown, 2019).
1. Iris Dataset
In this case, we will have to begin with the alignment of the Iris dataset and according to the
screen-shot below, we have up to 150 instances as well as 5 number of attributes to be
considered in for analysis. Figure 1 shows the actual descriptive statistics on the sepal length
variable to the right where the mean, standard deviation, the minimum and the maximum
characters are all indicated.

MODELLING 3
figure 1.
figure 2.
Figure 2 gives statistics on sepal width and this gives 23 entries that are distinct from other
entries made. Moving on to the next variable, we have; figure 3 which is on petal length
figure 3.

MODELLING 4
figure 4.
Figure 4 majors on petal width.
figure 5
Figure five gives an actual illustration of what was stated earlier on and as it is, there is a total of
50 species instances out of the total 150 total instances.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MODELLING 5
figure 6.
The above gives the graphical alignment of the entire distributions that are needed.
As of figure 7, there is a clear illustration of the actual scatter plot that gives the scatter
relationship of the actual variables that are to be used in for analysis. The most important thing to
note is, the blue, red and green represent Iris setosa, Iris versicolor and Iris virginica respectively.
In the scatter plots, it is very clear to see that there are different clusters even as the variables
intersect.

MODELLING 6
figure 7.

MODELLING 7
a. Multilayer Perceptron
b. Naïve Bayes

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MODELLING 8
c. J48

MODELLING 9
d. Random Forest
e. REP tree

MODELLING 10
f. Models Comparison
What will be reported in this area are the accuracies that are given by the correctly classified
instances in each and every model at starting from the first model, we have values of 97%, 96%,
96%, 95% and 94% for Multilayer Perceptron, Naïve Bayes, J48, Random, REP tree. As it
stands from the percentages, Multilayer Perceptron is a better performer and REP tree is a poorer
performer of all of the models. The smaller the FP rate (False Positive) the better the model as it
seeks to classify the instances in their correct classes and from the models that perform better in
terms of correctly classified instances, they have lower FP rate and therefore more instances are
classified correctly as per the confusion matrices (Bravo-Marquez, Frank, Pfahringer &
Mohammad, 2019). The parameters that are important to report are the precision, the recall, F

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MODELLING 11
measure and the ROC area (Receiver Operator Characteristic) under the curve. The closer these
areas are close to 1 the better the model of classification.
2. Breast Cancer
In this as the usual constants that were looked into in the Iris case will also be looked into but not
with weighty explanations as the explanations used in the previous section can be sued for follow
up.
figure 1
From above it is clear to see that there are a total of 286 instances.
figure 2
Figure 2 gives the distribution on age
figure 3