Machine Learning on Health Tweets Case Study 2022
VerifiedAdded on 2022/10/14
|25
|4450
|14
AI Summary
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
MINING 1
MACHINE LEARNING ON HEALTH TWEETS
NAME OF AUTHOR
NAME OF PROFESSOR
NAME OF CLASS
NAME OF SCHOOL
STATE AND CITY
DATE
MACHINE LEARNING ON HEALTH TWEETS
NAME OF AUTHOR
NAME OF PROFESSOR
NAME OF CLASS
NAME OF SCHOOL
STATE AND CITY
DATE
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
MINING 2
ABSTRACT
This is part will give a brief overview of what the whole literature body of the machine learning
report will entail. First of all, there will be an analytical focus on tweets dataset. There will be an
introductory part that gives any reader the understanding that he or she is about to read data
analytics literature. Data set will be described in thorough detail stating attributes as well as the
actual preprocessing stages and the results of the actual preprocessing of the data set. The main
data mining techniques will be elaborated in details as well. A machine learning technique is
chosen with an aim as to why it is being applied. There will be a statement if the machine
learning algorithm is a classification or a clustering algorithm. And this can only be answered if
the problem that you have developed by the dataset is a classification or a clustering problem.
There will be the evaluation and demonstration section in which the developed model is
compared with another and the reason for its relevance over the one that is left is also stated
clearly. The conclusion section will be the wrap-up part of the whole report.
INTRODUCTION
There was a provision of a data set that contains the medical description on what is
actually for health and what is not good for health. This is more of health advice through tweets.
This was actual tweets that were sent through different URL links. The response variable by far
over here is the description variable and this is what classification models will be built under
(Ryu and Moon, 2016). So by far, the kind of classification will be text mining classification
method. There will be two classification methods compared to one another. The comparison of
these two on the same dataset will largely aid in the evaluation and demonstration section as
there will be a point out on which one is a batter classification model than the other (Zhou, Tong,
Gu and Gall, 2016).
ABSTRACT
This is part will give a brief overview of what the whole literature body of the machine learning
report will entail. First of all, there will be an analytical focus on tweets dataset. There will be an
introductory part that gives any reader the understanding that he or she is about to read data
analytics literature. Data set will be described in thorough detail stating attributes as well as the
actual preprocessing stages and the results of the actual preprocessing of the data set. The main
data mining techniques will be elaborated in details as well. A machine learning technique is
chosen with an aim as to why it is being applied. There will be a statement if the machine
learning algorithm is a classification or a clustering algorithm. And this can only be answered if
the problem that you have developed by the dataset is a classification or a clustering problem.
There will be the evaluation and demonstration section in which the developed model is
compared with another and the reason for its relevance over the one that is left is also stated
clearly. The conclusion section will be the wrap-up part of the whole report.
INTRODUCTION
There was a provision of a data set that contains the medical description on what is
actually for health and what is not good for health. This is more of health advice through tweets.
This was actual tweets that were sent through different URL links. The response variable by far
over here is the description variable and this is what classification models will be built under
(Ryu and Moon, 2016). So by far, the kind of classification will be text mining classification
method. There will be two classification methods compared to one another. The comparison of
these two on the same dataset will largely aid in the evaluation and demonstration section as
there will be a point out on which one is a batter classification model than the other (Zhou, Tong,
Gu and Gall, 2016).
MINING 3
Machine learning, in the 21st century, is highly used on classification and regression
purposes and this aids in the understanding of different datasets and the trends and pattern that
they have got to display. Okay, giving an off-topic example, but which is n exactly business-
related kind of a topic is trying to check which customers exactly would churn and which ones
would not churn in the long run from telecommunication services that are being offered by a
telecommunication company (Zainudin, Shamsuddin and Hasan, 2019). In this case, the variable
of interest would be the churn variable or feature column as this is the column that contains the
actual listing on a customer's loyalty or not. In this case, when classification is run for
understanding, for example, the logistic regression which is the simplest classification algorithm
to understand the churn rate is developed, then the actual classification of those customers that
can churn and those who stay true to the company will be developed. After the classification,
there can be test using the confusion matrix, to better see which customers were classified where
and which ones were classified wrongly. This aid the business to get to know where to put
resources accordingly as it would be easier and cheaper to maintain already acquired customers
than to get new customers (Balcan, Sandholm and Vitercik, 2018.).
From the illustration, in the above paragraph, you can realize that machine learning and machine
learning classifications aid in the operations of a company hence making profits.
Machine learning though is largely used across industries and in our case, we will be focusing
on health data on information that were tweeted on what is and what is not recommended to be
followed with staying healthy. This by far then indicates that machine learning can be used in
health in different ways.
Machine learning, in the 21st century, is highly used on classification and regression
purposes and this aids in the understanding of different datasets and the trends and pattern that
they have got to display. Okay, giving an off-topic example, but which is n exactly business-
related kind of a topic is trying to check which customers exactly would churn and which ones
would not churn in the long run from telecommunication services that are being offered by a
telecommunication company (Zainudin, Shamsuddin and Hasan, 2019). In this case, the variable
of interest would be the churn variable or feature column as this is the column that contains the
actual listing on a customer's loyalty or not. In this case, when classification is run for
understanding, for example, the logistic regression which is the simplest classification algorithm
to understand the churn rate is developed, then the actual classification of those customers that
can churn and those who stay true to the company will be developed. After the classification,
there can be test using the confusion matrix, to better see which customers were classified where
and which ones were classified wrongly. This aid the business to get to know where to put
resources accordingly as it would be easier and cheaper to maintain already acquired customers
than to get new customers (Balcan, Sandholm and Vitercik, 2018.).
From the illustration, in the above paragraph, you can realize that machine learning and machine
learning classifications aid in the operations of a company hence making profits.
Machine learning though is largely used across industries and in our case, we will be focusing
on health data on information that were tweeted on what is and what is not recommended to be
followed with staying healthy. This by far then indicates that machine learning can be used in
health in different ways.
MINING 4
Aim of the whole of this assignment is to help prove one of the many ways in which
machine learning and more specifically how WEKA analytics tool can aid understand health
data.
DATA SUMMARY AND PREPROCESSING
The data set that was to be worked on for the machine learning algorithms that will be chosen in
the next report section, was to be downloaded from the UCI machine learning repository. The
dataset was in a zip that contained up to 16 text files. All the 16 text files were the health-related
dataset and the main idea in all that is that the actual responses and opinions given over some
health route that most people take and what exactly they should avoid and what they should pick
as a health route for staying healthy (Keleş and Keleş, 2018).
The chosen dataset from the 16 sets of datasets is the foxnewshealth.txt and this had to be
transformed into a CSV file for easier upload into WEKA. Upon transformation into a CSV file,
several variables were created and this includes; ID, MONTH, DATE, DAY TIME YEAR,
DESCRIPTION and finally URL. Since these were tweets datasets, there had to be the URL
links for every tweet sent by the person tweeting. From the fact that the dataset was a health
dataset, there had to be a belief that all those who were tweeting the same must have been from
the medical field.
After the.CSV transformation, the dataset had to be loaded onto the WEKA, but in most cases,
this was not possible as there was a need to do a regular and thorough clean up to aid in the
attainment of the actual CSV file that could be loaded up to the WEKA analytics tool. The clean-
up was accompanied by the deletion of multiple rows that were coded in a non-WEKA data line
formats. After the loading of the dataset into the WEKA tool, then the transformation of the CSV
Aim of the whole of this assignment is to help prove one of the many ways in which
machine learning and more specifically how WEKA analytics tool can aid understand health
data.
DATA SUMMARY AND PREPROCESSING
The data set that was to be worked on for the machine learning algorithms that will be chosen in
the next report section, was to be downloaded from the UCI machine learning repository. The
dataset was in a zip that contained up to 16 text files. All the 16 text files were the health-related
dataset and the main idea in all that is that the actual responses and opinions given over some
health route that most people take and what exactly they should avoid and what they should pick
as a health route for staying healthy (Keleş and Keleş, 2018).
The chosen dataset from the 16 sets of datasets is the foxnewshealth.txt and this had to be
transformed into a CSV file for easier upload into WEKA. Upon transformation into a CSV file,
several variables were created and this includes; ID, MONTH, DATE, DAY TIME YEAR,
DESCRIPTION and finally URL. Since these were tweets datasets, there had to be the URL
links for every tweet sent by the person tweeting. From the fact that the dataset was a health
dataset, there had to be a belief that all those who were tweeting the same must have been from
the medical field.
After the.CSV transformation, the dataset had to be loaded onto the WEKA, but in most cases,
this was not possible as there was a need to do a regular and thorough clean up to aid in the
attainment of the actual CSV file that could be loaded up to the WEKA analytics tool. The clean-
up was accompanied by the deletion of multiple rows that were coded in a non-WEKA data line
formats. After the loading of the dataset into the WEKA tool, then the transformation of the CSV
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
MINING 5
file into an ARFF file, which is the WEKA data file type. After this, the saving was done and
then now loaded afresh into WEKA (Alcala-Fdez et al. 2016).
The explorer tab was to be used for everything, both the exploration and pre-processing machine
learning, data size and data quality examination. After the clean up the dataset was left with up to
1936 cases and with eight attributes. The ID attribute has 48 distinct values. And the remaining
statistical description is as below;
Figure 1
file into an ARFF file, which is the WEKA data file type. After this, the saving was done and
then now loaded afresh into WEKA (Alcala-Fdez et al. 2016).
The explorer tab was to be used for everything, both the exploration and pre-processing machine
learning, data size and data quality examination. After the clean up the dataset was left with up to
1936 cases and with eight attributes. The ID attribute has 48 distinct values. And the remaining
statistical description is as below;
Figure 1
MINING 6
From here we can see an actual listing of the mean, median, maximum and standard deviation.
There are clear histogram listings and fir different classes.
Figure 2
Rom figure 2, it is evident to see the highest number of days that were considered when taking
the tweets dataset. Wednesday was focused on the most and this is indicated by the high number
in frequency for Wednesday. Then this was followed by Tuesday, Friday, Thursday, Monday,
Saturday and finally Sunday. There is a lower number of tweets realized on Saturday and Sunday
and this might be explained by the fact that they are days that are work off days and people
From here we can see an actual listing of the mean, median, maximum and standard deviation.
There are clear histogram listings and fir different classes.
Figure 2
Rom figure 2, it is evident to see the highest number of days that were considered when taking
the tweets dataset. Wednesday was focused on the most and this is indicated by the high number
in frequency for Wednesday. Then this was followed by Tuesday, Friday, Thursday, Monday,
Saturday and finally Sunday. There is a lower number of tweets realized on Saturday and Sunday
and this might be explained by the fact that they are days that are work off days and people
MINING 7
would not want to visit any social media platforms for this. Of the days of the week from
Monday to Friday, there is a lower number realized on Monday and this can be explained by the
fact that most people are usually sluggish on Mondays.
Figure 3
From figure 3 it is evident to see which month has the highest and which one has the lowest
entries of being picked in the activity of tweets data collection. November being the least and
with obvious reasons, people entering the holidays and most people take holidays so seriously
and therefore tweets less.
would not want to visit any social media platforms for this. Of the days of the week from
Monday to Friday, there is a lower number realized on Monday and this can be explained by the
fact that most people are usually sluggish on Mondays.
Figure 3
From figure 3 it is evident to see which month has the highest and which one has the lowest
entries of being picked in the activity of tweets data collection. November being the least and
with obvious reasons, people entering the holidays and most people take holidays so seriously
and therefore tweets less.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
MINING 8
The overall plot is as figure 4 below;
Figure 4.
From figure 4, it is evident to see that there were some attributes with no plots at all and this is
because they were all string variables that could not be plotted.
The overall plot is as figure 4 below;
Figure 4.
From figure 4, it is evident to see that there were some attributes with no plots at all and this is
because they were all string variables that could not be plotted.
MINING 9
Figure 5
Of figure 5 above, there is a show of scatter plots relation between different attributes ad the
spread is different in all cases. Others are linear, others spread all over the entire graph area,
other taking a lower part of a graph while others are positively or negatively linear.
DATA MINING TECHNIQUES
In this case, there will a choosing, and the actual use of data mining techniques of our own
choice. This is done to help satisfy our application aim. Before we make the actual choice on
which classification model to use in our machine learning project, we have to understand both
classification and clustering. Clustering is an unsupervised machine learning set of algorithms
where a machine is algorithm is not told what to do from time to time but learns on its own over
time and then when new data is supplied to it, it eventually decides where and which group to
Figure 5
Of figure 5 above, there is a show of scatter plots relation between different attributes ad the
spread is different in all cases. Others are linear, others spread all over the entire graph area,
other taking a lower part of a graph while others are positively or negatively linear.
DATA MINING TECHNIQUES
In this case, there will a choosing, and the actual use of data mining techniques of our own
choice. This is done to help satisfy our application aim. Before we make the actual choice on
which classification model to use in our machine learning project, we have to understand both
classification and clustering. Clustering is an unsupervised machine learning set of algorithms
where a machine is algorithm is not told what to do from time to time but learns on its own over
time and then when new data is supplied to it, it eventually decides where and which group to
MINING 10
put and item (Kulkarni and Kulkarni, 2016). Take for instance a group of fruits, where we have
bananas, apples, mangoes and pineapples mixed up without a plan. When a clustering algorithm
is supplied, it will be able to recognize which category the mixed up fruits fall under and classify
them accordingly in their respective groups (Naik and Samant, 2016).
Types of clustering algorithms are partitioning methods, Hierarchical clustering, fuzzy
clustering, density-based clustering, model-based clustering (Mohsen et al. 2017).
Then the next thing we will do is get an understanding of classification. Classification is partly a
supervised set of algorithms where the models are told what to do from time to time even after
developing a model. In these algorithms, train data is used to develop algorithms after which test
data is supplied and later on the algorithm developed from the train data predicts what comes
next based on the test data that has been supplied to it (Arganda-Carreras et al. 2017).
The types of classification machine learning algorithms are; Linear Classifiers (Logistic
Regression and Naive Bayes Classifier), Nearest Neighbor, Support Vector Machines, Decision
Trees, Boosted Trees, Random Forest, Neural Network (Kodati, Vivekanandam and Ravi, 2019).
The problem that we will be focusing on will be a classification problem. There was the desire to
choose two classification model or algorithms and due to this, Decision Trees and Naïve Bayes
were chosen. the reason for this was to check which one was better than the other in terms of
performance percentage. To help us understand what each algorithm stand for, it is only wise
enough to look into the cons and pros of each and every algorithm as this would help in actually
getting a clearer picture of the two even before doing a thorough dive in into the actual analysis
that will be gotten from the WEKA dataset that will have to be provided into the system under
respective algorithms.
put and item (Kulkarni and Kulkarni, 2016). Take for instance a group of fruits, where we have
bananas, apples, mangoes and pineapples mixed up without a plan. When a clustering algorithm
is supplied, it will be able to recognize which category the mixed up fruits fall under and classify
them accordingly in their respective groups (Naik and Samant, 2016).
Types of clustering algorithms are partitioning methods, Hierarchical clustering, fuzzy
clustering, density-based clustering, model-based clustering (Mohsen et al. 2017).
Then the next thing we will do is get an understanding of classification. Classification is partly a
supervised set of algorithms where the models are told what to do from time to time even after
developing a model. In these algorithms, train data is used to develop algorithms after which test
data is supplied and later on the algorithm developed from the train data predicts what comes
next based on the test data that has been supplied to it (Arganda-Carreras et al. 2017).
The types of classification machine learning algorithms are; Linear Classifiers (Logistic
Regression and Naive Bayes Classifier), Nearest Neighbor, Support Vector Machines, Decision
Trees, Boosted Trees, Random Forest, Neural Network (Kodati, Vivekanandam and Ravi, 2019).
The problem that we will be focusing on will be a classification problem. There was the desire to
choose two classification model or algorithms and due to this, Decision Trees and Naïve Bayes
were chosen. the reason for this was to check which one was better than the other in terms of
performance percentage. To help us understand what each algorithm stand for, it is only wise
enough to look into the cons and pros of each and every algorithm as this would help in actually
getting a clearer picture of the two even before doing a thorough dive in into the actual analysis
that will be gotten from the WEKA dataset that will have to be provided into the system under
respective algorithms.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
MINING 11
i. Advantages of a Decision Tree
a. Considering multiple consequences as it gives you room to look into data in multiple
different ways and not only in one way.
b. Decision trees are easy to understand as there are braches to leaves hence the formation
of a clear flow from the notch way down into the leaves. For the bulky trees, pruning can
be done to aid in the simpler trees that are easy to understand and that results in better
accuracy.
c. Transparency; the use of nodes is addressing uncertainties and this makes the end process
more clear.
d. Simplicity; A major decision tree analysis advantages are its ability to assign specific
values to problem, decisions, and outcomes of each decision. This reduces ambiguity in
decision-making.
e. Ease of use; the graphical illustration that the decision tree provides is making it easier
for data scientists to make easier inferences. This is evident as the tree branch from node
to leaves which continues to grow afterwards.
f. Flexibility; Unlike other decision-making tools that require comprehensive quantitative
data, decision trees remain flexible to handle items with a mixture of real-valued and
categorical features and items with some missing features. Once constructed, they
classify new items quickly.
ii. Disadvantages
a. Wrongly placed decisions while developing a decision tree, may result in lack of
contingencies a scenario that might turn out to be detrimental in analysis in the long run
i. Advantages of a Decision Tree
a. Considering multiple consequences as it gives you room to look into data in multiple
different ways and not only in one way.
b. Decision trees are easy to understand as there are braches to leaves hence the formation
of a clear flow from the notch way down into the leaves. For the bulky trees, pruning can
be done to aid in the simpler trees that are easy to understand and that results in better
accuracy.
c. Transparency; the use of nodes is addressing uncertainties and this makes the end process
more clear.
d. Simplicity; A major decision tree analysis advantages are its ability to assign specific
values to problem, decisions, and outcomes of each decision. This reduces ambiguity in
decision-making.
e. Ease of use; the graphical illustration that the decision tree provides is making it easier
for data scientists to make easier inferences. This is evident as the tree branch from node
to leaves which continues to grow afterwards.
f. Flexibility; Unlike other decision-making tools that require comprehensive quantitative
data, decision trees remain flexible to handle items with a mixture of real-valued and
categorical features and items with some missing features. Once constructed, they
classify new items quickly.
ii. Disadvantages
a. Wrongly placed decisions while developing a decision tree, may result in lack of
contingencies a scenario that might turn out to be detrimental in analysis in the long run
MINING 12
as the expected results cannot be obtained. This fact, therefore, requires serious
assessment. That brings about accuracy in implementing decisions.
b. They are unstable, a small change in data leads to a larger change in the structure of the
tree structure.
c. They are relatively inaccurate as most other algorithms perform better with similar data,
for example, the random forest.
d. The calculation gets very complicated particularly when most data point or attribute are
entirely related.
Since there will be a comparison between decision trees and Naïve Bayes, it is only important
that we do an inclusion on the same here as well for a start. Naïve Bayes is a supervised machine
learning algorithm as well and largely depends on train and test data sets to make a prediction
(Gupta et al. 2017).
Advantages of Naïve Bayes
i. The model converges quicker than the other models like logistic regression.
ii. You need less training dataset when using this model (Mohammed, Ali and Hassan,
2019).
Disadvantages of Naïve Bayes
i. It cannot learn the interaction between features. For instance, if your love for food
and movies are the same, it cannot know that at all (Jadhav and Channe, 2016).
as the expected results cannot be obtained. This fact, therefore, requires serious
assessment. That brings about accuracy in implementing decisions.
b. They are unstable, a small change in data leads to a larger change in the structure of the
tree structure.
c. They are relatively inaccurate as most other algorithms perform better with similar data,
for example, the random forest.
d. The calculation gets very complicated particularly when most data point or attribute are
entirely related.
Since there will be a comparison between decision trees and Naïve Bayes, it is only important
that we do an inclusion on the same here as well for a start. Naïve Bayes is a supervised machine
learning algorithm as well and largely depends on train and test data sets to make a prediction
(Gupta et al. 2017).
Advantages of Naïve Bayes
i. The model converges quicker than the other models like logistic regression.
ii. You need less training dataset when using this model (Mohammed, Ali and Hassan,
2019).
Disadvantages of Naïve Bayes
i. It cannot learn the interaction between features. For instance, if your love for food
and movies are the same, it cannot know that at all (Jadhav and Channe, 2016).
MINING 13
EVALUATION AND DEMONSTRATION
In conduction of the models for classification, since we are using WEKA, we only need to
point at specific tabs and choose what needs to be chosen for an algorithm and this aids so much
in the development of these algorithms in an easier way as opposed to cases like in R, Python
and other cord requiring algorithms where one will be needed to write codes when developing
machine learning algorithms. There will be a comparison of the two models stated above and
then there will be a rule out on which one exactly is better than the other. More things are tied to
declaring a decision model to be a better classifier than the next one other than results that have
been deduced from the development of the models. There is a need to look into the compatibility
of the data set with the model since some datasets cam mot be easily run on other models. For
our case, both Naive Bayes and Decision Tree can handle our dataset in WEKA software (Smith
and Frank, 2016).
In WEKA one can set a reference feature, by actually going to the edit tab, right-click ad
set as reference feature the actual feature that you want to set right. This automatically will go to
the very last column, click ok and give the watch the variable goes down in the view tab.
In our case URL is the variable of consideration and it the last variable, so there will be
no need to set it as response e variable since it is already a response variable (Wong and Senthil,
2018).
The URL column being the very last column in the list of attributes, it will be the
reference variable for classification since it is the attribute which when used gives the highest
accuracy of them all and the accuracy stands at 98% for the Decision Tree case. But then the
correctly classified instances stand at 1824 and we have up to 21 incorrectly classified instances.
EVALUATION AND DEMONSTRATION
In conduction of the models for classification, since we are using WEKA, we only need to
point at specific tabs and choose what needs to be chosen for an algorithm and this aids so much
in the development of these algorithms in an easier way as opposed to cases like in R, Python
and other cord requiring algorithms where one will be needed to write codes when developing
machine learning algorithms. There will be a comparison of the two models stated above and
then there will be a rule out on which one exactly is better than the other. More things are tied to
declaring a decision model to be a better classifier than the next one other than results that have
been deduced from the development of the models. There is a need to look into the compatibility
of the data set with the model since some datasets cam mot be easily run on other models. For
our case, both Naive Bayes and Decision Tree can handle our dataset in WEKA software (Smith
and Frank, 2016).
In WEKA one can set a reference feature, by actually going to the edit tab, right-click ad
set as reference feature the actual feature that you want to set right. This automatically will go to
the very last column, click ok and give the watch the variable goes down in the view tab.
In our case URL is the variable of consideration and it the last variable, so there will be
no need to set it as response e variable since it is already a response variable (Wong and Senthil,
2018).
The URL column being the very last column in the list of attributes, it will be the
reference variable for classification since it is the attribute which when used gives the highest
accuracy of them all and the accuracy stands at 98% for the Decision Tree case. But then the
correctly classified instances stand at 1824 and we have up to 21 incorrectly classified instances.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
MINING 14
The size of the tree is 1920 and the number of leaves in 1917. Time to build the model was up to
0.44seconds. Ignored cases that were not classified at all are 91 in total. The absolute accuracy is
quite low indicating a better model and it is at 0.00. All the results that have been realized is
actually from the use of the training dataset of the whole dataset. There were no folds used as the
results that are gotten when folds are used for the construction of a decision tree actually gives a
lower performance of the model developed. the accuracy of such only stands at 0.83% when
folds are used and what this translates to is that few instances are taken into account and more
than 1700 instances are ignored. This translates into high numbers in the errors that are realized.
(Habibi, Ahmadi and Alizadeh, 2015). For the pictorial view of the realized results, see the
figure below;
Moving on to the Naïve Bayes for text classification in WEKA, we will use the default split
ratio of 66% for train data and 34% for test data. The same response variable that was used for
The size of the tree is 1920 and the number of leaves in 1917. Time to build the model was up to
0.44seconds. Ignored cases that were not classified at all are 91 in total. The absolute accuracy is
quite low indicating a better model and it is at 0.00. All the results that have been realized is
actually from the use of the training dataset of the whole dataset. There were no folds used as the
results that are gotten when folds are used for the construction of a decision tree actually gives a
lower performance of the model developed. the accuracy of such only stands at 0.83% when
folds are used and what this translates to is that few instances are taken into account and more
than 1700 instances are ignored. This translates into high numbers in the errors that are realized.
(Habibi, Ahmadi and Alizadeh, 2015). For the pictorial view of the realized results, see the
figure below;
Moving on to the Naïve Bayes for text classification in WEKA, we will use the default split
ratio of 66% for train data and 34% for test data. The same response variable that was used for
MINING 15
the development of the decision tree will also serve right here and this was the URL. The only
way to get to set the reference variable for classification models development, one goes to the
edit window and the click on the header of the respective variable column that he or she wishes
to make a reference variable. The said variable will be the one that takes the very last variable
number. Our case we will use the very final variable and there will be no changes on the
developed dataset already.
The Naïve Bayes model was built in a shorter time and this was at exactly 0.01 sec and the time
taken to test the model on test split was up to 0.56. The correctly classified instances by far are 5
in total, a number lesser than that of the total instances classified by decision tree model. so far
the decision tree is better in total time and the number of instances classified. The accuracy on
Naïve Bayes goes up to 0.79% which is short of that of the decision tree. So far still the decision
tree proves to be way better than the Naïve Bayes model. A total number of instances for is 627
and the ignored class instances stand at 31 these are smaller number and therefore, a Decision
tree is way better than the Naïve Bayes in everything (Gull, Padhye and Jain, 2017). The actual
output in terms of pictorial representation is as below;
the development of the decision tree will also serve right here and this was the URL. The only
way to get to set the reference variable for classification models development, one goes to the
edit window and the click on the header of the respective variable column that he or she wishes
to make a reference variable. The said variable will be the one that takes the very last variable
number. Our case we will use the very final variable and there will be no changes on the
developed dataset already.
The Naïve Bayes model was built in a shorter time and this was at exactly 0.01 sec and the time
taken to test the model on test split was up to 0.56. The correctly classified instances by far are 5
in total, a number lesser than that of the total instances classified by decision tree model. so far
the decision tree is better in total time and the number of instances classified. The accuracy on
Naïve Bayes goes up to 0.79% which is short of that of the decision tree. So far still the decision
tree proves to be way better than the Naïve Bayes model. A total number of instances for is 627
and the ignored class instances stand at 31 these are smaller number and therefore, a Decision
tree is way better than the Naïve Bayes in everything (Gull, Padhye and Jain, 2017). The actual
output in terms of pictorial representation is as below;
MINING 16
Moving to the actual evaluation of the two models that were used in classification, we will use
the experiments tab to load the two models and the data as well after this, both models will be
run to see which one performs better than the other. As it stands there are lot of interruptions
from the considered naïve Bayes as opposed to the decision tree which actually run smoothly and
shows the potentiality of handling lots of datasets without interruptions. The screen shot of the
performance evaluation is as per the diagram that has been given below;
Moving to the actual evaluation of the two models that were used in classification, we will use
the experiments tab to load the two models and the data as well after this, both models will be
run to see which one performs better than the other. As it stands there are lot of interruptions
from the considered naïve Bayes as opposed to the decision tree which actually run smoothly and
shows the potentiality of handling lots of datasets without interruptions. The screen shot of the
performance evaluation is as per the diagram that has been given below;
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
MINING 17
From the above figure, it is evident to see that the evaluation was run from start to finish and the
are no errors in both of the models that were being tested.
The performance evaluation process is as below;
From the above figure, it is evident to see that the evaluation was run from start to finish and the
are no errors in both of the models that were being tested.
The performance evaluation process is as below;
MINING 18
The connectivity that lead to an ROC is as below;
The connectivity that lead to an ROC is as below;
MINING 19
The ROC is as below;
The ROC is as below;
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
MINING 20
The closer the curve is to the upper right corner, the better the model.
CONCLUSION
In the long run, there is clear evidence that the Decision tree was going to overpower
Naïve Bayes as from this there are more advantages on the side of the Decision tree algorithm as
opposed to on the side of the Naïve Bayes hence NAÏVE Bayes experiencing a lower weight
(Bilal et al. 2016) Every dataset can be analyzed to make inferences for later use. the only
The closer the curve is to the upper right corner, the better the model.
CONCLUSION
In the long run, there is clear evidence that the Decision tree was going to overpower
Naïve Bayes as from this there are more advantages on the side of the Decision tree algorithm as
opposed to on the side of the Naïve Bayes hence NAÏVE Bayes experiencing a lower weight
(Bilal et al. 2016) Every dataset can be analyzed to make inferences for later use. the only
MINING 21
problem with analyzing text data is that it requires more specialized care as opposed to the
numerical data as even the inferences themselves can be easily deduced. Text data is harder to
clean as opposed to numeric data.
Tweets classification as realized by the development of these models can be easily done
and made per class and according to the classification rules of a classification model as far.
problem with analyzing text data is that it requires more specialized care as opposed to the
numerical data as even the inferences themselves can be easily deduced. Text data is harder to
clean as opposed to numeric data.
Tweets classification as realized by the development of these models can be easily done
and made per class and according to the classification rules of a classification model as far.
MINING 22
References
Alcala-Fdez, J., Garcia, S., Fernandez, A., Luengo, J., Gonzalez, S., Saez, J.A., Triguero, I.,
Derrac, J., Lopez, V., Sanchez, L. and Herrera, F., 2016. Comparison of KEEL versus open
source Data Mining tools: Knime and Weka software.
Arganda-Carreras, I., Kaynig, V., Rueden, C., Eliceiri, K.W., Schindelin, J., Cardona, A. and
Sebastian Seung, H., 2017. Trainable Weka Segmentation: a machine learning tool for
microscopy pixel classification. Bioinformatics, 33(15), pp.2424-2426.
Balcan, M.F., Sandholm, T. and Vitercik, E., 2018, June. A general theory of sample complexity
for multi-item profit maximization. In Proceedings of the 2018 ACM Conference on Economics
and Computation (pp. 173-174). ACM.
Bilal, M., Israr, H., Shahid, M. and Khan, A., 2016. Sentiment classification of Roman-Urdu
opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of
King Saud University-Computer and Information Sciences, 28(3), pp.330-344.
Gull, K., Padhye, S. and Jain, D.S., 2017. A Comparative Analysis of Lexical/NLP Method with
WEKA’s Bayes Classifier. International Journal on Recent and Innovation Trends in
Computing and Communication (IJRITCC), 5(2), pp.221-227.
References
Alcala-Fdez, J., Garcia, S., Fernandez, A., Luengo, J., Gonzalez, S., Saez, J.A., Triguero, I.,
Derrac, J., Lopez, V., Sanchez, L. and Herrera, F., 2016. Comparison of KEEL versus open
source Data Mining tools: Knime and Weka software.
Arganda-Carreras, I., Kaynig, V., Rueden, C., Eliceiri, K.W., Schindelin, J., Cardona, A. and
Sebastian Seung, H., 2017. Trainable Weka Segmentation: a machine learning tool for
microscopy pixel classification. Bioinformatics, 33(15), pp.2424-2426.
Balcan, M.F., Sandholm, T. and Vitercik, E., 2018, June. A general theory of sample complexity
for multi-item profit maximization. In Proceedings of the 2018 ACM Conference on Economics
and Computation (pp. 173-174). ACM.
Bilal, M., Israr, H., Shahid, M. and Khan, A., 2016. Sentiment classification of Roman-Urdu
opinions using Naïve Bayesian, Decision Tree and KNN classification techniques. Journal of
King Saud University-Computer and Information Sciences, 28(3), pp.330-344.
Gull, K., Padhye, S. and Jain, D.S., 2017. A Comparative Analysis of Lexical/NLP Method with
WEKA’s Bayes Classifier. International Journal on Recent and Innovation Trends in
Computing and Communication (IJRITCC), 5(2), pp.221-227.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
MINING 23
Gupta, B., Rawat, A., Jain, A., Arora, A. and Dhami, N., 2017. Analysis of various decision tree
algorithms for classification in data mining. International Journal of Computer Applications,
163(8), pp.15-19.
Habibi, S., Ahmadi, M. and Alizadeh, S., 2015. Type 2 diabetes mellitus screening and risk
factors using decision tree: results of data mining. Global journal of health science, 7(5), p.304.
Jadhav, S.D. and Channe, H.P., 2016. Comparative study of K-NN, naive Bayes and decision
tree classification techniques. International Journal of Science and Research (IJSR), 5(1),
pp.1842-1845.
Keleş, A.E. and Keleş, M.K., 2018, April. Determination and Classification of Crew Productivity
with Data Mining Methods. In Data Mining (p. 35). BoD–Books on Demand.
Kodati, S., Vivekanandam, R. and Ravi, G., 2019. Comparative Analysis of Clustering
Algorithms with Heart Disease Datasets Using Data Mining Weka Tool. In Soft Computing and
Signal Processing (pp. 111-117). Springer, Singapore.
Kulkarni, E. and Kulkarni, R.B., 2016. Weka powerful tool in data mining. International
Journal of Computer Applications, 975, p.8887.
Gupta, B., Rawat, A., Jain, A., Arora, A. and Dhami, N., 2017. Analysis of various decision tree
algorithms for classification in data mining. International Journal of Computer Applications,
163(8), pp.15-19.
Habibi, S., Ahmadi, M. and Alizadeh, S., 2015. Type 2 diabetes mellitus screening and risk
factors using decision tree: results of data mining. Global journal of health science, 7(5), p.304.
Jadhav, S.D. and Channe, H.P., 2016. Comparative study of K-NN, naive Bayes and decision
tree classification techniques. International Journal of Science and Research (IJSR), 5(1),
pp.1842-1845.
Keleş, A.E. and Keleş, M.K., 2018, April. Determination and Classification of Crew Productivity
with Data Mining Methods. In Data Mining (p. 35). BoD–Books on Demand.
Kodati, S., Vivekanandam, R. and Ravi, G., 2019. Comparative Analysis of Clustering
Algorithms with Heart Disease Datasets Using Data Mining Weka Tool. In Soft Computing and
Signal Processing (pp. 111-117). Springer, Singapore.
Kulkarni, E. and Kulkarni, R.B., 2016. Weka powerful tool in data mining. International
Journal of Computer Applications, 975, p.8887.
MINING 24
Mohammed, R.A., Ali, A.E. and Hassan, N.F., 2019. Advantages and Disadvantages of
Automatic Speaker Recognition Systems. Journal of Al-Qadisiyah for computer science and
mathematics, 11(3), pp.Comp-Page.
Mohsen, H., El-Dahshan, E.A., El-Horbaty, E.M. and Salem, A.M., 2017. Brain tumor type
classification based on support vector machine in magnetic resonance images. Annals Of
“Dunarea De Jos” University Of Galati, Mathematics, Physics, Theoretical mechanics, Fascicle
II, Year IX (XL), (1).
Naik, A. and Samant, L., 2016. Correlation review of classification algorithm using data mining
tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia Computer Science, 85, pp.662-
668.
Ryu, S.H. and Moon, H.J., 2016. Development of an occupancy prediction model using indoor
environmental data based on machine learning techniques. Building and Environment, 107, pp.1-
9.
Smith, T.C. and Frank, E., 2016. Introducing machine learning concepts with WEKA. In
Statistical genomics (pp. 353-378). Humana Press, New York, NY.
Wong, M.L. and Senthil, S., 2018, August. Applying Attribute Selection Algorithms in
Academic Performance Prediction. In International Conference on Intelligent Data
Communication Technologies and Internet of Things (pp. 694-701). Springer, Cham.
Mohammed, R.A., Ali, A.E. and Hassan, N.F., 2019. Advantages and Disadvantages of
Automatic Speaker Recognition Systems. Journal of Al-Qadisiyah for computer science and
mathematics, 11(3), pp.Comp-Page.
Mohsen, H., El-Dahshan, E.A., El-Horbaty, E.M. and Salem, A.M., 2017. Brain tumor type
classification based on support vector machine in magnetic resonance images. Annals Of
“Dunarea De Jos” University Of Galati, Mathematics, Physics, Theoretical mechanics, Fascicle
II, Year IX (XL), (1).
Naik, A. and Samant, L., 2016. Correlation review of classification algorithm using data mining
tool: WEKA, Rapidminer, Tanagra, Orange and Knime. Procedia Computer Science, 85, pp.662-
668.
Ryu, S.H. and Moon, H.J., 2016. Development of an occupancy prediction model using indoor
environmental data based on machine learning techniques. Building and Environment, 107, pp.1-
9.
Smith, T.C. and Frank, E., 2016. Introducing machine learning concepts with WEKA. In
Statistical genomics (pp. 353-378). Humana Press, New York, NY.
Wong, M.L. and Senthil, S., 2018, August. Applying Attribute Selection Algorithms in
Academic Performance Prediction. In International Conference on Intelligent Data
Communication Technologies and Internet of Things (pp. 694-701). Springer, Cham.
MINING 25
Zainudin, Z., Shamsuddin, S.M. and Hasan, S., 2019. Deep Learning for Image Processing in
WEKA Environment. International Journal of Advances in Soft Computing & Its Applications.
Zhou, Y., Tong, Y., Gu, R. and Gall, H., 2016. Combining text mining and data mining for bug
report classification. Journal of Software: Evolution and Process, 28(3), pp.150-176.
Zainudin, Z., Shamsuddin, S.M. and Hasan, S., 2019. Deep Learning for Image Processing in
WEKA Environment. International Journal of Advances in Soft Computing & Its Applications.
Zhou, Y., Tong, Y., Gu, R. and Gall, H., 2016. Combining text mining and data mining for bug
report classification. Journal of Software: Evolution and Process, 28(3), pp.150-176.
1 out of 25
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.