logo

Hate Speech Detection using Word Embedding and GRU

   

Added on  2024-05-08

10 Pages3765 Words484 Views
IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
Vol.13, No.1, January 2019, pp. 43~52
ISSN (print): 1978-1520, ISSN (online): 2460-7258
DOI: https://doi.org/10.22146/ijccs.40125 43

Received October 26th,2018; Revised January 23th, 2019; Accepted January 29th, 2019

Hate Speech Detection
for Indonesia Tweets Using Word
Embedding And Gated Recurrent Unit

Junanda Patihullah
*1, Edi Winarko2
1
Program Studi S2 Ilmu Komputer FMIPA UGM, Yogyakarta, Indonesia
2
Departemen Ilmu Komputer dan Elektronika, FMIPA UGM, Yogyakarta, Indonesia
e
-mail: *1jpatihullah@gmail.com, 2ewinarko@ugm.ac.id
Abstrak

Media sosial telah mengubah cara orang dalam mengekspresikan pemikiran

dan suasana hati. Seiring meningkatnya aktifitas pengguna sosial media, tidak menutup

kemungkinan tindak kejahatan penyebaran ujaran
kebencian dapat menyebar secara
cepat dan meluas. Sehingga tidak memungkinkan untuk mendeteksi ujaran kebencian

secara manual. Metode Gated Recurrent Unit (GRU) adalah salah satu metode deep

learning
yang memiliki kemampuan mempelajari hubungan informasi dari waktu sebe-
lumnya dengan waktu sekarang. Pada penelitian ini fitur ekstraksi yang digunakan ad
a-
lah word2vec, karena memiliki kemampuan mempelajari semantik
antar kata. Pada
penelitian ini kinerja metode GRU dibandingkan dengan metode supervised
lainnya
seperti Support Vector Machine, Naive Bayes, Random Forest dan Regresi logistik.

Hasil yang didapat menunjukkan akurasi terbaik dari GRU dengan fitur
word2vec ada-
lah sebesar 92,96%. Penggunaan
word2vec pada metode pembanding memberikan
hasil akurasi yang
lebih rendah dibandingkan dengan penggunaan fitur TF dan TF-
IDF.

Kata kunci
Gated Recurrent Unit, hate speech, word2vec, RNN, Word Embedding
Abstract

Social media has changed the people mindset to express thoughts and moods. As
the activity of social media users increases, it does not rule out the possibility of crimes
of spreading hate speech can spread quickly and widely. So that it is not possible to
detect hate speech manually. GRU is one of the deep learning methods that has the
ability to learn information relations from the previous time to the present time. In this
research feature extraction used is word2vec, because it has the ability to learn
semantics between words. In this research the Gated Recurrent Unit (GRU)
performance is compared with other supervised methods such as Support Vector
Machine, Naive Bayes, Random Forest, and Logistic Regression. The result of
experiments shows that the the combination of word2vec and GRU gives the best
accuracy of 92.96%. However, the used of word2vec in the comparison methods results
in the lower accuracy than the used of TF and TF-IDF features.

Keywords
Gated Recurrent Unit, hate speech, word2vec, RNN, Word Embedding

ISSN (print): 1978-1520, ISSN (online): 2460-7258
IJCCS Vol. 13, No. 1, January 2019 : 43 52

44

1. INTRODUCTION

Social media as a means of communication can disseminate information quickly and
widely, making it not only as a means of friendship and various information, but also used as a
means of trading, dissemination of government policies, political campaigns and religious
preaching [1]. With the increasing activity of social media users, it does not rule out the
possibility of cyber crime such as the dissemination of information containing hate speech. Hate
speech on the social media can be in the form of words that contain hatred in writing and shown
to individuals or groups to the detriment of the targeted party. Detecting hate speech is very
important to analyze public sentiments from certain groups towards other groups, so as to
prevent and minimize unwanted actions or things [2].

Detection of hate speech for Indonesian language has been done before, using bag of
words, namely word n-gram and character n-gram. Machine learning algorithms used for
classification are, Bayesian Logistic Regresion, Naive Bayes, Support Vector Machine and
Random Forest Decision Tree. Currently, the highest F-measure was achieved when using word
n-gram, especially when combined with Random Forest Decission Tree (93.5%), Bayesian
Logistic Regresion (91.5%) and Naive Bayes (90.2%) [3]. Detection of hate speech of
Indonesian language can also be done using backpropagation neural network algorithm with a
combination of lexicon based and bag of words features with the highest accuracy obtained at
78.81%. [4]. In this paper, we propose the combination of word embedding as our feature and
Gated Recurrent Unit (GRU) as our classifier for hate speech detection in Indonesian Tweets.

2. METHODS

In this section, we discuss architecture and methods used to detect hate speech in
Indonesia tweets. The main stages in this research are three parts, preprocessing, feature
extraction and classification, as can be seen in Figure 1. Each of this part is described in the
following subsection.

Figure 1. Hate Speech Detection Arsitecture

IJCCS ISSN (print): 1978-1520, ISSN (online): 2460-7258
Hate Speech Detection In Indonesia Language Use Word Embedding... (Junanda Patihullah)

45

2.1 Preprocessing

Preprocessing stage is very important in classification to get the best model. The tweet
processing consists of several steps: 1) Escaping html characters; 2) Removal of punctuation; 3)
Split attached words; 4) Case folding; 5) Tokenization; 6) Convert slangwords; 7) Removal of
stop-words. Escaping html character aims to remove URL link and also html character that
often found in tweets. Remove of punctuation is used to delete special characters that are often
found in tweets such as hastag (#), @user, retweet (RT). Beside that at this stage will also
removing punctuation. Split attached words,we humans in the social forums generate text data,
which is completely informal in nature. Most of the tweets are accompanied with multiple
attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their
normal forms using simple rules and regex. Case folding is a proses of converting all characters
into lowercase. Tokenize is a task of splitting text into smaller units. Convert slangwords, is a
process to be transformed of a majority slang words into standard words. The next step is
stopword removal. Stopwords is words on uninformative, these words will be remove based on
the existing stoplist dictionary. This research is using the stop-word list from Rahmawan [5].

2.2
Feature Extraction
Word2Vec is the name of the word vector representation defined by Mikolov et al. [8].
The main basis or component for generating vector values in word2vec is artificial neural
networks built from CBOW and Skip-gram architectures. Before word2vec can represent the
vector value for each word, word2vec will first create a model of the word distribution during
training using Indonesian documents collected from Wikipedia. The number of documents used
is 1,120,973. In order to build the word2vec feature model, there are three processes involved,
i.e., vocabulary builder, context builder, and neural network. Figure 2 shows the three processes
in the word2vec model building.

Figure 2. word2vec's main architecture

End of preview

Want to access all the pages? Upload your documents or become a member.