SIT717 Assignment 3: WEKA Data Analysis of Health Response Tweets

Verified

Added on  2022/10/17

|16
|1560
|14
Presentation
AI Summary
This presentation analyzes health response tweets using WEKA software for business intelligence purposes. The analysis begins with an introduction to the topic, followed by a data summary that discusses data types, structure, and preprocessing, including the conversion of data from text to CSV and then ARFF formats. The presentation then explores data mining techniques, specifically classification algorithms like decision trees and Naive Bayes, evaluating their performance based on accuracy, error rates, and time taken to build the models. The decision tree model is found to be superior, achieving a high accuracy rate of 98%. The presentation concludes with an evaluation section that highlights the confidence level derived from the comparison of both models, and a recommendation for the decision tree model due to its better performance and the ease of use offered by WEKA. References to relevant research papers are also included.
Document Page
HEALTH RESPONSE
TWEETS ANALYSIS FOR
BUSINESS INTELLIGENCE
WEKA DATA ANALYSIS PRESENTATION
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
EXECUTIVE SUMMARY
There was a provision of aligning the actual structure of the actual report in stages or sections of discussion. The
actual sections in this case are:
INTRODUCTION
DATA SUMMARY
DATA MINING TECHNIQUES
EVALUATION AND DEMOSTRATION
In other words, what is clearly know is that, the executive summary or the literature review in other times, gives
exactly what will be there in the actual report. There can be little explanations on each and every sections that are
to be spoken about on the executive summary part but this can not be made to be in excess since there is literally
a smaller word count in this part.
Document Page
INTRODUCTION
Analysis is to be done on Weka software.
Machine learning is the basis of the whole study as analysis will be on tweets.
Topic is on health tweets analysis. The responses that patients and medical practitioners give will
be analyzed to help give the actual classification groups they fall under. The whole brain storming
of the topic is to be done in this area on the report.
The actual dataset extraction and the discussion of the analytical software is at this stage.
The responses that were given per tweet could be used for profit benefits as well by actually
analyzing the type of complaints that are included in tweets and actually the most relevant stake
holders acting accordingly to help offer the required to the complainants or just in a bid of
rectifying a situation.
Document Page
DATA SUMMARY
Here actually, data type, data structure and the whole preprocessing will be talked about. The
software of analysis will also be discussed in detail.
Data had to be transferred from a text format into a csv file for upload into WEKA.
There had to be variables that were developed from the original dataset as this would aid easier
upload into WEKA. The dates had to be split farther as well as the Tweet ID which was split into
description and URL. This evidently gives the whole dataset more variables. More variables
reduces the extensions that there is that can deter a data set from being loaded up onto WEKA.
After the upload of the dataset into WEKA in CSV format, then it is only advisable to actually
have a convention of the dataset into arff format. The Weka dataset format is the arff format.
Saving is done and the dataset is reloaded into WEKA again.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA SUMMARY
Upon upload of the arff file, there will be the statistical analysis that pops up automatically.
There are variables that do not actually have descriptive statistics nor any plots as there to
many data points to be analyzed.
Variables with no basic plot during pre-processing are the URL and the Description variables.
Other data points either have the mean, median, maximum, minimum values or counts if
they are logical values.
What accompanies the variables that give constants in return are the respective plots that
actually give the distribution of each and every variable (YE, LI, ADJEROH AND IYENGAR,
2017).
Document Page
DATA SUMMARY
Document Page
DATA SUMMARY
The plot on the previous slide shows how the distributions have been presented.
moving on it is clear that URL, time and descriptions to not have plots cause of very
many data points to be included in analyzing the type of plots to be presented.
Variables such as ID, Date, Month actually indicate the real distribution of the number
of times an ID, a date, a Month, is considered for tweets data collection.
The dates, months and days that give higher frequencies actually indicate the fact that
there actually exists those times that individuals tweet a lot.
The less frequent days have less frequent tweet activities people might be off work or
on holiday duing those days and therefore rarely touch their phones.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MINING TECHNIQUE
As usual data will be analyzed in the WEKA software. All the insights will be derived from all
the processes that will be run here.
Technique of analysis is; Classification. This technique has several classification algorithms.
Classification algorithms are decision trees and Naïve Bayes.
For decision trees there is an advantage of actually having ease in handling it when doing
analysis but a disadvantage of the whole structure changing adversely for changing a
small entry in the dataset.
For naïve Bayes, there is a an advantage of actually it having the ease to handle text data
and a disadvantage of performing lower than models like random forest.
Document Page
DATA MINING TECHNIQUE (DECISION TREE)
This by far is a supervised machine learning algorithm.
There will be the use of a train set of data, and this means we will not use the n-fold decision
tree.
The n-fold decision tree does not actually discard a lot of instances when developing
algorithms and takes unreasonably small amount of time in developing the desired algorithm.
The split of the dataset will be as per WEKA’s default settings.
The desired variable will be the URL variable as per the arrangement of data and the fact that
we will be doing our analysis on a set of text entries variable (SHAHIRI AND HUSAIN, 2015).
Document Page
DATA MINING TECHNIQUE(DECISION TREE)
We start by looking at the relevant results that have been developed.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MINING(DECISION TREE)
From the results, there is a clear indication that the more instances were
used and they stood at 1824 against 21 which were actually not correctly
classified.
The time for developing the model as from above is highly lower hence the
model is quicker to develop.
The accuracy of the model stands at 98% which is relatively very high as
compared to other models.
The error rates are low and this is owed to the high accuracy rates (FENG
AND ZHU, 2016).
Document Page
DATA MINING TECHNIQUE(NAÏVE BAYES)
This will, serve as the counter model in all of the algorithms development.
there will be a need to check the difference between its performance and the
performance of the decision tree algorithm that has been developed so far.
chevron_up_icon
1 out of 16
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]