logo

Machine Learning on Health Tweets Case Study 2022

   

Added on  2022-10-14

25 Pages4450 Words14 Views
MINING 1
MACHINE LEARNING ON HEALTH TWEETS
NAME OF AUTHOR
NAME OF PROFESSOR
NAME OF CLASS
NAME OF SCHOOL
STATE AND CITY
DATE

MINING 2
ABSTRACT
This is part will give a brief overview of what the whole literature body of the machine learning
report will entail. First of all, there will be an analytical focus on tweets dataset. There will be an
introductory part that gives any reader the understanding that he or she is about to read data
analytics literature. Data set will be described in thorough detail stating attributes as well as the
actual preprocessing stages and the results of the actual preprocessing of the data set. The main
data mining techniques will be elaborated in details as well. A machine learning technique is
chosen with an aim as to why it is being applied. There will be a statement if the machine
learning algorithm is a classification or a clustering algorithm. And this can only be answered if
the problem that you have developed by the dataset is a classification or a clustering problem.
There will be the evaluation and demonstration section in which the developed model is
compared with another and the reason for its relevance over the one that is left is also stated
clearly. The conclusion section will be the wrap-up part of the whole report.
INTRODUCTION
There was a provision of a data set that contains the medical description on what is
actually for health and what is not good for health. This is more of health advice through tweets.
This was actual tweets that were sent through different URL links. The response variable by far
over here is the description variable and this is what classification models will be built under
(Ryu and Moon, 2016). So by far, the kind of classification will be text mining classification
method. There will be two classification methods compared to one another. The comparison of
these two on the same dataset will largely aid in the evaluation and demonstration section as
there will be a point out on which one is a batter classification model than the other (Zhou, Tong,
Gu and Gall, 2016).

MINING 3
Machine learning, in the 21st century, is highly used on classification and regression
purposes and this aids in the understanding of different datasets and the trends and pattern that
they have got to display. Okay, giving an off-topic example, but which is n exactly business-
related kind of a topic is trying to check which customers exactly would churn and which ones
would not churn in the long run from telecommunication services that are being offered by a
telecommunication company (Zainudin, Shamsuddin and Hasan, 2019). In this case, the variable
of interest would be the churn variable or feature column as this is the column that contains the
actual listing on a customer's loyalty or not. In this case, when classification is run for
understanding, for example, the logistic regression which is the simplest classification algorithm
to understand the churn rate is developed, then the actual classification of those customers that
can churn and those who stay true to the company will be developed. After the classification,
there can be test using the confusion matrix, to better see which customers were classified where
and which ones were classified wrongly. This aid the business to get to know where to put
resources accordingly as it would be easier and cheaper to maintain already acquired customers
than to get new customers (Balcan, Sandholm and Vitercik, 2018.).
From the illustration, in the above paragraph, you can realize that machine learning and machine
learning classifications aid in the operations of a company hence making profits.
Machine learning though is largely used across industries and in our case, we will be focusing
on health data on information that were tweeted on what is and what is not recommended to be
followed with staying healthy. This by far then indicates that machine learning can be used in
health in different ways.

MINING 4
Aim of the whole of this assignment is to help prove one of the many ways in which
machine learning and more specifically how WEKA analytics tool can aid understand health
data.
DATA SUMMARY AND PREPROCESSING
The data set that was to be worked on for the machine learning algorithms that will be chosen in
the next report section, was to be downloaded from the UCI machine learning repository. The
dataset was in a zip that contained up to 16 text files. All the 16 text files were the health-related
dataset and the main idea in all that is that the actual responses and opinions given over some
health route that most people take and what exactly they should avoid and what they should pick
as a health route for staying healthy (Keleş and Keleş, 2018).
The chosen dataset from the 16 sets of datasets is the foxnewshealth.txt and this had to be
transformed into a CSV file for easier upload into WEKA. Upon transformation into a CSV file,
several variables were created and this includes; ID, MONTH, DATE, DAY TIME YEAR,
DESCRIPTION and finally URL. Since these were tweets datasets, there had to be the URL
links for every tweet sent by the person tweeting. From the fact that the dataset was a health
dataset, there had to be a belief that all those who were tweeting the same must have been from
the medical field.
After the.CSV transformation, the dataset had to be loaded onto the WEKA, but in most cases,
this was not possible as there was a need to do a regular and thorough clean up to aid in the
attainment of the actual CSV file that could be loaded up to the WEKA analytics tool. The clean-
up was accompanied by the deletion of multiple rows that were coded in a non-WEKA data line
formats. After the loading of the dataset into the WEKA tool, then the transformation of the CSV

MINING 5
file into an ARFF file, which is the WEKA data file type. After this, the saving was done and
then now loaded afresh into WEKA (Alcala-Fdez et al. 2016).
The explorer tab was to be used for everything, both the exploration and pre-processing machine
learning, data size and data quality examination. After the clean up the dataset was left with up to
1936 cases and with eight attributes. The ID attribute has 48 distinct values. And the remaining
statistical description is as below;
Figure 1

MINING 6
From here we can see an actual listing of the mean, median, maximum and standard deviation.
There are clear histogram listings and fir different classes.
Figure 2
Rom figure 2, it is evident to see the highest number of days that were considered when taking
the tweets dataset. Wednesday was focused on the most and this is indicated by the high number
in frequency for Wednesday. Then this was followed by Tuesday, Friday, Thursday, Monday,
Saturday and finally Sunday. There is a lower number of tweets realized on Saturday and Sunday
and this might be explained by the fact that they are days that are work off days and people

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Data Analytics for Cybersecurity
|28
|2909
|246

Health Response Tweets Analysis for Business Discussion 2022
|16
|1560
|14

Data Mining for Car Sales: Analysis of Consumer Behavior
|10
|584
|489

Data Mining on Twitter Data using Machine Learning Algorithms
|12
|2724
|292

Health Response Tweets Analysis for Business Intelligence Analysis 2022
|19
|4213
|12

Data Analysis and Digital Operations for Customer Loyalty in a Telecommunication Company
|18
|2711
|438