Location Mining from Tweets: A Review of POS Tagging Algorithms

Verified

Added on 2023/04/20

AI Summary

This paper presents a critical review of five articles focusing on the application of Part-of-Speech (POS) tagging algorithms for location mining from tweets. It begins by highlighting the wealth of information available on social media platforms and the challenges associated with analyzing this data. The review summarizes and discusses various Twitter data analysis tools, data crawlers, and sentiment analysis techniques used in conjunction with POS tagging. It then delves into specific algorithms such as the Rapid Automatic Keyword Extraction (RAKE)-based algorithm and the GATE Twitter Part-Of-Speech Tagger-based algorithm, detailing their functionalities and integration processes. The conclusion emphasizes the challenges in analyzing social media data and the importance of algorithms in detecting topics, sentiments, and correlated topics over time, ultimately evaluating and comparing multiple algorithms for different analysis components.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

POS tagging algorithm for location mining from
tweets
Abstract—Social media platforms contain a great wealth of
information which provides opportunities for people to
explore some hidden pattern or the unknown correlations. In
this paper, a critical review of five articles based on the POS
tagging algorithm for location mining from tweets is
presented. The articles were chosen since they contain viable
information which are related in some aspects and in terms of
their content and similarity of the algorithm presentation in
each. This paper also presents the analysis from the literature
review of the articles
I. INTRODUCTION
Today, there exist many social media platforms which
enables their users to link with others and share many
kinds of information. These social media data has
important information and provide the users with
opportunity to explore hidden patterns or the unknown
correlations, and understand the people’s satisfaction
with what they are discussing for different numerous
topics. While many of these social media information
and data exist in the public domain, it remains
challenging to analyse the data to mine the useful
information.
II. CRITICAL REVIEW OF THE ARTICLES
This work is inspired and related to multiple groups
of research. In this section, we summarize and
briefly discuss them as follows.
Twitter Data Analysis Tools. There are a few tools
available for analysing social networking data for
different application scenarios.
Keyhole1 offers an extensive number of packaged
analytics visualisations that illustrate metrics in an
easy-to-read graphs and layouts for keywords,
account summary and so on. It provides a variety of
dashboards to indicate the results according to user
input hashtag as the search key.
Tweet Sentiment visualization2 is an analytics
application developed to study ways to visualize
sentiment for unstructured and also non-
grammatical tweet. It offers a comprehensive suite
of sentiment visualization techniques that use
searched keywords to analyze the sentiment behind
each tweet associated with the searched keywords.
1
2
Twitter Analytics3 is another analytics application
that is developed by Twitter. It has two main tools.
One is Tweet activity dashboard which allows user
to learn more about their Tweets and understand
their audience. The other is an enhanced analytics
known as audience insights, which provides a more
de-tailed breakdown of user’s followers to help
advertiser’s better strategies their advertisement.
Data Crawlers. There are a few crawling systems
which have been used in the past few years to
support Twitter research. Song et.al [1] explored
topological and geographical properties using
Twitter. Using REST API methods, they extracted
tweets from April 1st to May 30th 2007 and
obtained around 1.3million tweets from 76k users.
For the period that the authors crawled, Twitter had
just started up such that it is insufficient to collect a
significant amount of data. Several other
researchers crawled Tweet from Twitter to
investigate sentiment analysis [2], to develop spam
detection system to identify suspicious users [3] and
to detect critical events promptly [4]. Most of the
research were systems that is focusing on specific
data.
Sentiment Analysis. Sentiment analysis is a growing
area of Natural Language Processing. It can be
handled at many levels of granularity, starting from
being a document-level classification [5] to
sentence-level classification [6] and more recent at
phrase-level classification [7].
III. ANALYSIS OF ALGORITHM
Rapid Automatic Keyword Extraction (RAKE)-
based Algorithm. The second topic extraction
algorithm we developed is based on RAKE. RAKE is a
well-known algorithm implemented in Python for
extracting keywords from text[18]. It is an algorithm
that is category-independent and language-
independent for extracting keywords from text. The
algorithm works by extracting all the non-stopwords
and then scoring these phrases across the text.
Unlike other algorithms, it does not remove
punctuation signs and instead treated as sentence
boundaries. It also uses one stopwords list where
the stopwords are treated as phrase boundaries to
3

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

help generate keywords or phrases that consist of
one or more non-stopwords. After that, the
algorithm computes the properties of each
extracted keyword or phrase which is to sum the
scores for each of its words. It is scored according to
their frequency and the length of the phrase in
which they appear. We develop one Java version of
RAKE in the system. Figure 1 indicates the result of
processing the tweet example 1 based on the RAKE
algorithm.
Figure 1. Sample output of RAKE algorithm.
GATE Twitter Part-Of-Speech Tagger-based
Algorithm. The third program is developed based on
the Gate Twitter Part-of-Speech (POS) Tagger which
is one stateof-the-art tagger for tweets data. The
tagger aims to achieve competitive accuracy and
was developed using the Penn Treebank tag set so
that it can integrate the tagger into any tools
seamlessly. The tagger is an adapted and
augmented version of a leading Conditional Random
Field (CRF) -based tagger, customised for English
tweets [20]. It stated that the tagger achieved 91%
accuracy on tokens on their evaluation set which is
considered very high for tweets. Most importantly,
it has a relatively high accuracy on whole-sentence
correct. For tasks like dependency parsing and event
extraction, it is crucial to achieve good performance
for getting the whole sentence.
Figure 2. Procedure of Integrating GATE Twitter POS
Tagger for topic modeling
Figure 2 above illustrates how to integrate GATE
Twitter POS tagger with topic modelling to obtain a
list of topics or keywords. Our developed topic
modeling algorithm works as follows: it takes in
tweets and passes it to a tagging function to tag all
the RT, username, hashtag and URL. With the
output, it will pass to maxentTagger with GATE
Twitter POS tagger model which will perform part-
of-speech for the rest of the words. When the GATE
Twitter POS tagger produce the output, it will then
undergo tokenization to select words that are
tagged with JJ or NN (Noun, singular), NNS (Noun,
plural) and NNP (Proper noun) and continue to add
the position of the word. If the next word is tagged
as NN, NNS, NNP it will then continue to do so until
the next word is not tagged as JJ or NN, NNS, NNP
and save it as extracted words/phrases.
IV. CONCLUSION
While social media platforms contain a great wealth
of information, it remains challenging to analyse
them for different application purpose. In this paper,
we analysed the tweets to automatically detect the
topics, sentiments and correlated topics over time
and employs different visualizations after
information integration. In addition, we evaluated
and compared multiple algorithms for each analysis
component for different design decisions and
choices.
REFERENCES
[1] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A.
Smith, “From tweets to polls: Linking text sentiment to public
opinion time series,” 01 2010.
[2] S. Asur and B. A. Huberman, “Predicting the future with social
media,” CoRR, vol. abs/1003.5699, 2010. [Online]. Available:
http://arxiv.org/abs/1003.5699
[3] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede,
“Lexiconbased methods for sentiment analysis,”
Computational Linguistics, vol. 37, no. 2, p. 267307, 2011.
[4] D. Jurafsky and J. H. Martin, Speech and language processing:
an introduction to natural language processing, computational
linguistics, and speech recognition. Dorling Kindersley Pvt, Ltd.,
2014.
[5] H. S. Paskov, “Learning with n-grams: From massive scales to
compressed representations,” Mar 2017.
[6] P. B. Awachate and V. P. Kshirsagar, “Improved twitter
sentiment analysis using n gram feature selection and
combinations,” International Journal of Advanced Research in
Computer and Communication Engineering, vol. 5, no. 9, Sep
2016.
[7] L. Polanyi and A. Zaenen, “Contextual valence shifters,” in
Computing attitude and affect in text: Theory and applications.
Springer, 2006, pp. 1–10.

[8] “twitter sentiment analysis scoring by sentence learn data
science,” 2018. [Online]. Available:
https://blog.exploratory.io/twitter-sentimentanalysis-scoring-
by-sentence-b4d455de3560
[9] A. I. Baqapuri, “Twitter sentiment analysis,” arXiv preprint
arXiv:1509.04219, 2015.
[10] C. Matyszczyk, “Trumps tweets: Android for nasty, iphone for
nice. cnet,” 2016.
[11] “powertrack api,” 2018. [Online]. Available:
https://developer.twitter.com/en/docs/tweets/filterrealtime/o
verview/powertrack-api.html
[12] C. Johnson, P. Shukla, and S. Shukla, “On classifying the political
sentiment of tweets,” cs. utexas. edu, 2012.