Data Mining on Twitter Data using Machine Learning Algorithms

Added on 2022-10-17

12 Pages2724 Words292 Views

Data mining on twitter data using machine learning algorithms in R
Executive summary
Twitter is and has been popular since its development and launch into internet system over the
years. The popularity of twitter over the years has attracted large number of subscribers hence
brought on board various groups of people that tweet on tweeter platforms. This has resulted
from several evidence of experiencing spam tweets, long sides’ reliable tweets that were tweeted
on the forum.
Occurrence of Advertisement in tweeter is as a result of tweets that are transmitted yet they have
the wrong information to the consumers. Most social communication and mail platforms try as
much as possible to limit having misleading advertisement and mails. This has led to most of the
forums to develop clear system on how to categorize emails as misleading or not misleading.
This has resulted to screening of misleading mails and contents in most occurrences by the
developed systems and after the assessment, the system decides whether the content is
misleading or not. In instance where the content is found to have been misleading the consumers,
decisions will be made depending on the forum and its influence left out to the consumers or it
may be erased from the system or moved to the misleading section of the system. This study
therefore brings about the use of train and test datasets to group twitter data sets. The response
variable has two groups of different character, the first group is the spammers (misleading
adverts) and the other is the non-spammers. Tweet -class is the response variable, which will be
referred to when elaborating R-programming .This programming is used in development of five
machine learning categories including algorithms.
Key words –Data set, Twitter, algorithms, R

Data Mining on Twitter Data using Machine Learning Algorithms_1

Introduction
Machine learning is important in daily operation of companies that have adopted its use.
Machine learning has been adopted from manufacturing to Agriculture companies to sales and
finally to health care system .Machine learning and artificial intelligence has impacted on lives
and increased the profit margins .Machine learning can be largely used in categorizing and
regression, and all that involves future predictions (Cordon Et al., 2018).Cases arising from
misleading tweets will be addressed by use of categorizes in R-analytics software.
Train dataset, test dataset one and two have been provided for classification. Theoretically and
practically, it is obvious to find a number of misleading tweet . Total is always slower than the
accurate information, hence leading to unknown data that is passed through the trained model to
give the same scenario. This is as a result of the number of individual that gets to tweet
misleading tweets is lower as compared to the number of individual that tweet relevant
tweets(Kumari,Vidya $Karitha,2019).The number of models that are to developed is five. From
the list of the classification models, some were unsupervised while others were supervised. Of
the list of the supervised learning, we get to have; decision trees, random forest, logistics
regression and naïve Bayes. The only unsupervised algorithm that will be applied is the; K-
means clustering (Arora et al., 2018).
Literature review
The machine learning process will be dependent on classification ,this will be translated into
clusters .Classified data are the only groups to be clustered. In supervised learning, a system is
instructed on what to do after being trained while in unsupervised learning, the reverse occurs .
Unsupervised learning connects the dots on its own from the fed data and makes a judgment of

Data Mining on Twitter Data using Machine Learning Algorithms_2

its own from the data, hence ease in categorization as one develops a reliable model and let it
to run hence future predictions with the help of a constant amount of data to be tested. The only
time one needs to change the bits of the developed model is when a new data set of train is
provided. In models developments, there is a need to have a train and a test data set. These can
be obtained from the main data set that is provided in a model by actually splitting into two to
give both train and test data sets (Bowers, Alex & Xiaoliang Zhou, 2019).
Of the performance metrics, there will be need to use confusion matrix, ROC and the AUC
curves to sort out the test of a performance problem. The remaining two performance matrices
are by checking the actual percentage performance of the test data 1 against test data 2 and the
actual error rates that the tested data sets do provide when running over respective developed
models (Bowers, Alex and Xiaoliang Zhou, 2019).
Supervised and unsupervised machine learning algorithms and their importance
The first choice of classification is the logistic regression ,which is a statistical model that uses
the binomial outcome to determine the classification, the relationship of a response variable
based on other predictor variables identified. In the process, probabilities are established.
Logistic regression is more of a classification ,algorithm in R as it is a predictive algorithm and
the reason is if predictive variable happens then response variable is classified as either one or
the other and this can be run on test data. Once test data has been used for prediction using the
already developed model, the performance model itself can be tested by using a confusion
matrix, ROC curve and AUC (Gelman et al. 2019).
The next classification model is the decision tree. This classification model uses nodes to get to
branches which then divide into further branches and later into more branches hence giving a
classification leaves of the response variable in question. The performance of this model is tested

Data Mining on Twitter Data using Machine Learning Algorithms_3

by the use of cross-validation plot. If the plot is not well promising then printing can be made on
the tree to have a clearer classifications. The pruning and predicting is the test performance
matrix (Chen et al. 2018).
The third model will be the naïve Bayes and the evaluation matrix will be the plot of the actual
models together with the confusion matrix that is gotten from the results of the model when
running (slamet et al. 2018)
The fourth model to be considered will be the Random forest and in this case, classification will
be done by the use of several artificial trees in the random forest that can be gotten by the use of
the random forest library that is there in R. The evaluation matrix will be the confusion matrix
add the rf plot that helps sees the actual error rate and where actually the model cannot be
improved further and at what number of trees in total is considered in the classification process
(Subudhi et al. 2019).
Development of classification algorithms
The datasets provided, were in form of text files , were loaded and viewed in R .The dataset is a
data frame that is in a single column each. Only when doing text mining or Natural language
processing one gets to use such datasets. For our case, the data frames must be converted into a
CSV file from the text files. After conversion, the variable or attribute names are given as per
the JSON format provided in the listing of how the dataset should be.
From here we will be diving deep into the classification model. We will be focusing on logistic
regression , first of all, upload all the data sets. We have train data and two test data sets. The
models will be developed using the provided train data and then tested using the provided test
datasets. All the test data sets must be used to make predictions , the reason is that the
performance of both the datasets need to be compared to help rule out which test data is the best

Data Mining on Twitter Data using Machine Learning Algorithms_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Data Analytics for Cybersecurity

|28

|2909

|246

Application of Machine Learning in Twitter

|14

|2994

|76

Application of Machine Learning Assignment 2022

|13

|2705

|17

Assessing and Comparing Classifier Performance

|34

|4614

|24

Implementing Machine Learning and Data Mining Techniques for Business Intelligence

|59

|4443

|440

SIT717 – Enterprise Business Intelligence | Supervised Learning

|22

|5751

|79

Data Mining on Twitter Data using Machine Learning Algorithms

End of preview

Data Analytics for Cybersecuritylg...

Application of Machine Learning in Twitterlg...

Application of Machine Learning Assignment 2022lg...

Assessing and Comparing Classifier Performancelg...

Implementing Machine Learning and Data Mining Techniques for Business Intelligencelg...