WEKA Text Analysis on Patients' Response on Health URLs, Data Analysis

Verified

Added on 2022/11/07

AI Summary

This report provides a detailed analysis of patient responses on health URLs using the WEKA software. It begins with an introduction to data analytics, machine learning, and the WEKA platform, emphasizing its user-friendly interface and popularity. The report then describes the dataset, including data preprocessing steps to convert text data into a CSV file suitable for WEKA, variable summaries, and visualizations. The core of the report focuses on data mining techniques, specifically classification and clustering, with an in-depth discussion of decision trees and Naive Bayes models. The report evaluates the decision tree models, highlighting their advantages and suitability for the analysis. The study explores how WEKA can be used to classify the trustworthiness of URLs based on patient feedback, illustrating a practical application of machine learning in healthcare. The report includes data visualization and the application of classification techniques to analyze patient feedback from tweets, aiming to classify URLs based on their trustworthiness.

WEKA 1
TEXT ANALYSIS ON PATIENTS’ RESPONSE ON WEKA
NAME OF AUTHOR
NAME OF PROFESSOR
NAME OF CLASS
STATE AND CITY
DATE

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

WEKA 2
Executive Summary
Data is to be downloaded from the EDU website under the URL; http://mlr.cs.umass.edu/ml/.
the data that will be downloaded will be in a zip file and the contents will be datasets that area actually
in text format and nothing else. From here, the analysis will be done as per requirement and the
required visualization will be gotten from the required tool of analysis in this question. Before the
analysis section, there will be an introductory part that illustrates the analytics background, the
motivation and the aim of the purpose that leads to the exercise behind this report. After the
introduction, there will follow the data introductory part that gives a clear view of the data processing
processes, the summary of every variable, the visualizations on every variable and the data size and the
actual data quality that is gotten from the actual analysis. There will be a section with explanations of
the main data techniques that are employed in the bid of satisfying application aim. The techniques
employed, weather classification of clustering, data mining algorithms adopted to execute the technique
chosen and there will be a discussion on the actual counterpart algorithm that proves to be viable for
the actual analysis of. The next section will be an evaluation and demonstration of the algorithm models
that have been selected for the technique chosen. There will then be a conclusion part in the long run.
Introduction of Data Analytics Background
The assignment that leads to the development of this order depends on machine learning and
lots of data mining techniques. The actual software that will be used is the WEKA software that was
developed by the university of WAIKATO. It has a rich set of the community since there are both
students, professors and specialities that are involved in the improvement of its features. This fact
makes WEKA grow in popularity and favour among other machine learning software. Additionally, what
makes WEKA highly interactive is the fact that WEKA does not require codes to be run in its platform for
results that are being sorted to be realized. The software itself allows for tabs to just be chosen and the

WEKA 3
machine learning algorithms that are to be developed to be chosen without too much of a hustle. If one
is not good in learning codes and have no codding skills, then they are surely covered as they can
definitely choose from both WEKA and rapid minor since these are machine learning software that is as
well more elaborate and yet do not require any codes that are run in them for results (Abusnaina,
Abdullah and Kattan, 2015).
Machine learning as it stands now is widespread and the is a subset of big data. From big data, we get
to the bigger body, machine learning. In machine learning, this is where actual algorithms that aid in
computer systems operations are based. In other words, from the name itself, it is a process that
ensures that machine is made to learn how to operate and process all the junk that it gets provided
with. For example, take the case of credit card transactions and email communication from one end to
another. There get to be fraudulent transactions indifferent times amongst those that are not
fraudulent. Systems are therefore set to detect fraudulent transactions and prevent such transactions
from taking place and therefore there is a security on a company that handles credit card and then there
are profits ensured for companies and people that receive payments since, in the initial stage, all the
fraudulent transactions are stopped from the first point. The best example is withdrawing payments
from PayPal, in this case, when one tries to withdraw money and the cancels with the aim of trying to
raise the amounts that are intended for withdrawal, one will surely be limited in terms of the process of
withdrawal, the time will be limited and the withdrawal will have to take several hours. This surely will
make one who has ill intentions on an individual's PayPal account to give up since there will be a limit for
quite some time. In the case of email communication emails that are received are either classified as
spam or non-spam based on the actual content of the email itself. This allows for the classification of
emails as spam or non-spam and the non-spam emails, therefore, are sent to peoples' inboxes whereas
the spam emails are sent into the spam folders.

WEKA 4
In our case, we will use WEKA to get to do machine learning tasks to help classify which URL gives a
piece of genuine and truthful advice on patients' health. Some feedbacks are gotten from patients that
have visited hospitals for treatment. These feedbacks are mostly tweets that are sent via peoples' URLs.
Therefore, there was the collection of the tweeters ID, the date details when the tweets are sent, the
description of the every individual's response and finally the URL from where the tweet was sent. There
will be classification in regards to the URL as some URLs are not trusted worthy while others are very
trustworthy.
Summary of Dataset
From the website or download, the dataset will be in text format and this cannot be
incorporated into WEKA in this format. There will be the need to transform it into a CSV file. The actual
dataset that is transformed into the CSV file will be having up to eight variables in total. This then
translates to; ID, DAY, MONTH, DATE, TIME, YEAR, DESCRIPTION and URL. The ID is for the ID of the
person sending the tweets ad the DAY variable is for the day the tweet was sent. The URL and the
DESCRIPTIONS are the links that sent the tweets and the content of the tweets respectively. The dataset
upon download had several cases that could not allow upload and therefore must have been cleaned to
aid in ease of upload for these cases had noises that WEKA could not contain. The clean-up ensured a
good amount of cases were thrown away. After the cleanup data was loaded into WEKA and then
transformed into an arff file, the file compatible and preferred for WEKA software. This then was
exported from WEKA and re-imported back to WEKA and therefore the dataset that was loaded into
WEKA was now a type of file.
The dataset had up to 1936 upon upload, the actual visualization of the actual variables are as below;

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

WEKA 5
From the above illustration, the ID variable visualization, there are the minimum, maximum, mean and
the standard deviation. The bars that are shown on the plot represent the frequency of a certain class of
IDs.

WEKA 6
The visualization of the age as well give the same listings that were given for the case of the IDs.

WEKA 7
This is the months variable and this gives weight in numbers, we have that March, January, February,
December have the highest number of observational tweets as opposed to April and November with
125 and 35 respectively.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

WEKA 8
The above visualization shows the actual scatter plot that there is on all the variables that there are for
study. When looking at the actual scatter for different variables, there is a realization that there are
those variables that have scatters that are evenly distributed across the representative graph area, there
too are those than have their scatters slanting linearly, with others slanting negatively and others
slanting positively. Others are horizontally spread and others as well.
The overall preprocessing visualization that is there will be as below;

WEKA 9
In this case, there appears the graphical representation of the actual variables and then there are those
variable for which plots cannot be made at all. These are time, Description and URL. In these variables,
the reason for the non-display of their graphs is because there are too many values to display. there
were only two years in consideration and these were the years of 2014 and of 2015. The year of 2014
had very few tweets collected that the year 2015. This can only be explained in terms of accessibility to
the internet. Maybe there were few people who could access the internet in 2015 because of lack of
adequate funds and poor infrastructure that made pockets of most people hence there were no smart
phones in the arms of people to tweet of the same. Home laptops were not available at the time and
therefore less people tweeted in 2014 than in 2015.
Data Mining Techniques
Data mining techniques that can be used easily here are both classification and clustering.
Classification is a supervised machine learning technique in which machine is supervised. This is a
supervised learning process that gets the machine to keep learning continuously the processes of

WEKA 10
prediction hence classification. Machines are frequently told what to do and how to operate even after
learning a process. The processes that are there for classification are; Linear Classifiers (Logistic
Regression and Naive Bayes Classifier), Nearest Neighbor, Support Vector Machines, Decision Trees,
Boosted Trees, Random Forest, Neural Network.
Moving on to the clustering part, in clustering actually, all that is needed is the identification of
similar kind of objects in name and structure and classifying them accordingly. Let us give a more
practical example; take for example there is a small child who lives with a dog in their home. They
develop friendship and they play from time to time. Then over time, there develops an eventuality and
the dog dies. A new dog, a neighbour's dog coms to their place, he recognizes the animal as a dog and
starts to laugh at it and they start playing. The kid was not trained to learn to know a dog, but it
observed the features of their previous dog and then automatically learns to know a new dog. This is
what is called clustering, learning without being taught or supervised. Therefore, clustering is a machine
learning technique that is unsupervised. Here machines just learn from the developed algorithms and
just learn from what was previously supplied to it.
The types of clustering will be as listed below; Hierarchical clustering, fuzzy clustering, density-
based clustering, model-based clustering (Kulkarni and Kulkarni 2016).
For our study, we will choose the use of classification as this is more direct, more open and more
understandable as these are very common. The actual algorithm models that will be picked are decision
trees and the Naïve Bayes models.
Starting from the decision trees, the decision tree in WEKA is very easy to in co-operating in models as
they, eventually are of different types and take in different types of datasets hence giving different
clarity as per what decision trees to apply where and what decision trees to apply where. The two
decision tree that can be used is then-fold decision trees and the trainset decision tree. Then-fold

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

WEKA 11
decision tree in most cases performs very poor and with low accuracy and the low number of classified
cases. Meaning, in the long run, more variables are discarded. This brings around high error values and
therefore is not supposed to be considered what so ever. This therefore only leaves us with the test set
decision tree option in WEKA as with this, data is already split into train ad test sets. This gives the
highest accuracy, to others it might seem unrealistic but in the long run, this is the most accurate. A
model that can give a high level of accuracy in all cases, provides the highest level of variability in points.
This, therefore, gives us this as the only choice.
Advantages of using a decision tree
i. Provision of different ways of looking at datasets and not only a single way, but this way also
provides the chances of looking at multiple consequences.
ii. Decision trees are built in such a way that there is a tree-like shape with braches from
nodes. This allows an analyst or scientists to easily see how nodes and braches branch into
farther variables with ease of classification and hence ease in depiction.
iii. There is transparency as there are nodes that address the process.
iv. Simplicity as there are clear classifications on the branches and therefore easy to depict.
v. Ease of use; the graphical illustration that the decision tree provides is making it easier for
data scientists to make easier inferences. This is evident as the tree branch from node to
leaves which continues to grow afterwards.
vi. Flexibility; Unlike other decision-making tools that require comprehensive quantitative data,
decision trees remain flexible to handle items with a mixture of real-valued and categorical
features and items with some missing features. Once constructed, they classify new items
quickly.
Disadvantages

WEKA 12
i. Wrongly placed decisions while developing a decision tree, may result in lack of
contingencies a scenario that might turn out to be detrimental in analysis in the long run as
the expected results cannot be obtained. This fact, therefore, requires serious assessment.
That brings about accuracy in implementing decisions.
ii. They are unstable, a small change in data leads to a larger change in the structure of the
tree structure.
iii. They are relatively inaccurate as most other algorithms perform better with similar data, for
example, the random forest.
iv. The calculation gets very complicated particularly when most data point or attribute are
entirely related.
The next classification of interest will be the naïve Bayes for text analysis since we will be dealing with
texts in our data. This will be used as a counter classification model that goes against the decision tree
model that has been selected in the long run. This, therefore, acts to see that we develop a model that is
used to dispute the fit-ability of the decision tree (Arganda-Carreras et al. 2017)
Advantages
i. Can handle text, a feature that fits less in most models in machine learning.
ii. Less training data is needed when training the model, in our case, we will only need 66%
against the above 70% for most models.
iii. Model converges faster and therefore there is less time needed to the sun and get
predictions.
Disadvantages
i. The learning of the relationship between and amongst different variables is not there.