logo

Film Review Sentiment Analysis: Logistic Regression vs Support Vector Classification

   

Added on  2024-05-08

12 Pages7063 Words312 Views
Journal of Applied Intelligent System (e-ISSN : 2502-9401 | p-ISSN : 2503-0493)
Vol. 8 No. 3, November 2023, pp. 341 352
DOI:

341

Film Review Sentiment Analysis: Comparison of Logistic
Regression and Support Vector Classification Performance
Based on TF-IDF

Dadan Saepul Ramdan*1, Riri Damayanti Apnena2

1,2Politeknik TEDC Bandung, Jl. Politeknik - Pasantren KM. 2 Lantai 1, Cibabat, Cimahi Utara,
Cibabat, Cimahi Utara, Kota Cimahi, Jawa Barat 40513

Email :
dsramdan@poltektedc.ac.id*1, riri.damayanti.apnena@poltektedc.ac.id2
*Corresponding author

Castaka Agus Sugianto3

3Politeknik TEDC Bandung, Jl. Politeknik - Pasantren KM. 2 Lantai 1, Cibabat, Cimahi Utara,
Cibabat, Cimahi Utara, Kota Cimahi, Jawa Barat 40513

Email : castaka@poltektedc.ac.id3

Received 12 Agustus 2023;

Abstract
Film sentiment analysis is a process for evaluating a sentiment value that exists in
film reviews, so that positive or negative responses from films can be identified. In this study, a

sentiment analysis will be carried out on film reviews on IMBD. The ana
lysis was carried out to
find out which reviews were positive and negative from film critics. The method used to carry

out sentiment analysis in this study is review analysis and processing with TF
-IDF and a positive
or negative prediction process based on
reviews that have been processed using a logistic
regression algorithm and support vector classification. The data to be used is film reviews on

IMBD, which consists of 2000 data, which is divided into 1000 positive data and 1000 negative

data. Which is w
here the data will be preprocessed first and split with a percentage of 70%
training data and 30% testing data. In the prediction process using the logistic regression

algorithm, obtaining a test accuracy of 80.61%. While the prediction process using the s
upport
vector classification algorithm obtains a test accuracy of 82.42%.

Keyword
s Sentiment Analysis, TF-IDF, Logistic Regression, Support Vector Classification, Film
1.
INTRODUCTION
Film is an art form that combines video, sound and narration to be able to convey
something to the audience. Because of this, films are a common means of entertainment for
everyone [1]. Fithratullah [2], argued that film is considered as part of a work of art that is made
based on the needs and desires that emerge from society. Therefore, the popularity of films is
usually based on the type of reviews given by the audience [3]. Coupled with current
technological advances, it is common that when people watch a film, they give and express their
opinions on public social networking sites [4]. Because of this, social media is now a source that
can be used to get opinions instantly [5], not only opinions but also someone's statements on
topics that are currently trending [6]. So in the analysis of film sentiment, the opinions or
opinions given by the audience about the film can be used to find out how the audience feels or
responds when watching a particular film. The responses given were divided into two classes,
namely positive responses and negative responses. Teixeira et. al [7], argued that film is a

342
process of character formation for events that can occur in a predetermined time and space.
Because of this, everyone's preferences and opinions can vary depending on that person's point
of view.

Sentiment analysis is a process for processing a text to be able to find out the value or
message contained therein [8]. Purnomoputra et. al [9], stated that sentiment analysis can be
used to classify films, whether they are good or bad films. The process carried out in sentiment
analysis is to do computations to be able to determine expressions or feelings from the reviews
given by the audience [10]. Dang et. al [11], stated that usually the data obtained to carry out
sentiment analysis is obtained from social media, where the audience provides a lot of
information and reviews of something. Sentiment analysis is part of the data mining process
related to natural language processing (NLP) [12], which has been a topic of research since early
2000 [13]. In sentiment analysis, the reviews given show how the user responds, responds or
reacts to a service or product [14]. So that from the results of the sentiment analysis obtained,
it can assist in the decision-making process [15] whether a service is good or bad. In the world
of film, the quality of the film itself is obtained from pre-existing audience reviews [16]. So that
sentiment analysis has a role to see whether the existing review is a positive or negative review.
Therefore, the process of sentiment analysis is included in the data classification method [17]
and plays a major role in conducting perspective analysis of the audience about something [18].

Machine learning or often referred to as machine learning (ML) is part of the artificial
intelligence family or commonly called artificial intelligence (AI) [19]. In the process, machine
learning works with large data to be able to train and optimize models based on algorithms,
where these models can later make predictions [20]. In machine learning, we don't need to
program the model to do the learning, which means the model can do the learning automatically
[21]. TR et. al [22], argued that the machine learning method works by inputting test samples
after training and the model learns patterns from existing data. The machine learning method is
considered beneficial because the model can learn from mistakes gradually to be able to
improve the performance of the model itself by learning more similar data [23]. In this study,
the sentiment analysis process will use the TF-IDF algorithm to convert text into vector form and
predictions using logistic regression algorithms and support vector classification so that
knowledge can be easily extracted from large amounts of data [24]. Machine learning methods
can also be useful in many sectors of the economy such as manufacturing, banks, etc. [25], not
only in a certain scope.

Term Frequency/Inverse Document Frequency (TF-IDF) is a method used for the mining
process of text. TF-IDF is usually used to weigh words based on their uniqueness, so that
relevance can be found between words, documents and certain categories. Zhou et. al [26],
argued that TF-IDF is a type of measurement in a statistical method that is widely used for data
processing in text form. TF-IDF is an effective method for extracting knowledge from an attribute
so that the attribute can represent the whole document properly [27]. In the process,
calculations are carried out using statistical methods to map or transform text into vectors, then
calculate the similarity between the data and the vector text [28].

Logistic regression (LR), is a popular and commonly used algorithm to classify data [29].
Logistic regression is widely used to carry out binary classification processes or classifications
that only have 2 class targets [30]. Pan et. al [31], argued that logistic regression is a linear
classification method that is easy and simple to use. Logistic regression is included in the type of
supervised learning [32]. In the process, logistic regression is used to measure the level of
statistical significance of each predictor variable with a probability approach [33].

Support Vector Classification (SVC) is part of the Support Vector Machine (SVM) which
has a structured risk minimization principle [34]. The way Support Vector Classification works is
the same as the way Support Vector Machine works, namely by minimizing the distance

343
between the decision boundary (Support Vector) and the sample (maximum margin) [35]. So, in
the process a hyperplane will be searched for each existing class sample [36]. Djedidi et. al [37],
stated that in this method a hyperplane will be sought to be able to divide between positive and
negative classes using the most optimal margins.

Research conducted by Soubraylu et. al [38], discussed sentiment analysis based on film
reviews using the hybrid convolutional bidirectional recurrent neural network method. This
study aims to be able to carry out sentiment analysis and build models using the hybrid deep
learning method that combines the convolutional neural network (CNN) method with the
bidirectional gated recurrent unit (BGRU) method. The results obtained in this study are that the
model built gets better results than other models, namely with an F1-Score of 87.62% and 77.4%
with the IMBD and Polarity dataset. In a study conducted by Bodapati et. al [39], discusses
sentiment analysis based on film reviews using the Long-Short Term Memory (LSTMs) method.
This study aims to be able to build models using the Long-Short Term Memory method or LSTMs
to be able to carry out sentiment analysis. The results obtained from this study are that the
model built succeeded in obtaining better accuracy compared to other methods, namely
88.46%.

Dalam artikel ini kami telah menginvestigasi proses review analisis untuk film
menggunakan TF-IDF berbasis Logistic Regression dan Support Vector Classification. Dari
beberapa penelitian yang telah dilkukan, terdapat salah satu algortima saja yang digunakan
missal Logistic Regression saja atau Support Vector Classification saja. Akurasi dari masing-
maisng algrotima masih dapat ditingkatkan dengan menambah parameter seperti yang kami
lakukan dan telah dijelaskan pada sub bab berikutnya. Diketahui bahwa Logistic Regression dan
Support Vector Classification dpat

2. RESEARCH METHOD

2.1.
Dataset
Film is an art form that combines video, sound and narration to be able to convey
something to the audience. Teixeira et. al [7], argued that film is a process of character formation
for events that can occur in a predetermined time and space. In this study, a sentiment analysis
of film reviews will be carried out. Sentiment analysis is part of the data mining process
Sentiment analysis can be used to classify films, whether they are good or bad films [9],
therefore the sentiment analysis process is related to natural language processing or often called
Natural Language Processing (NLP) [ 11]. So that from the results of the sentiment analysis
obtained, it can help in the decision-making process [15] whether to watch the film or not.
Because the quality of the film itself is obtained from the reviews of pre-existing audiences [16].

In the research that will be conducted, use the Pang and Lee's Movie Review Data dataset
obtained from the link http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/. The data to
be used amounts to 2000 data with 2 columns. The first column is used as the target variable
and the second column is used as the predictor variable. Of the 2000 existing data, it is divided
into 2 main classes, namely positive and negative. With the amount of data for each class,
namely 1000 positive data and 1000 negative data. For the predictor column, the data type used
is string, which contains movie reviews. The purpose of using this dataset is to be able to build
a model that can carry out sentiment analysis so that it can distinguish between positive and
negative reviews from the existing data.

Of the 2000 total existing data, it will be further divided into 2 data, namely training data
and testing data. The division is done by calculating the percentage of data as much as 70%
training data and 30% testing data. With target data for each data, both training data and test
data have as many as 2 classes, namely the positive class and the negative class. The purpose of

344
sharing data is to carry out development and testing on models with predetermined algorithms.
The training data is used to train the model using the algorithm used. In this study, the algorithm
used is a logistic regression classifier and also a support vector classifier. After training on the
model, model testing will be carried out using test data, so that performance measurements can
be carried out from the results of the sentiment analysis model training.

2.
2. Term Frecuency/Inverse Document Frecuency (TF-IDF)
Term Frequency/Inverse Document Frequency or can be abbreviated as TF-IDF is a
method that is usually used for processes related to natural language processing or often called
Natural Language Processing (NLP). TF or Term Frequency is a process for comparing a word that
appears in the text with the total number of words in the text. For TF calculations can be seen
in (1). IDF or Inverse Document Frequency is a process to measure how unique a word from a
corpus or group of words is in the text, so that for DF calculations, corpus data will be searched
that contains text or reviews. For IDF calculations, see (2). To get the TF-IDF score, it will be
multiplied between the TF value and the IDF score, for the calculation can be seen in (3).

𝑇𝐹(𝑊𝑜𝑟𝑑, 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ) = (𝑡𝑒 𝑤𝑜𝑟𝑑 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)
(𝑤𝑜𝑟𝑑 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡)

(1)

𝐼𝐷𝐹(𝑊𝑜𝑟𝑑, 𝐶𝑜𝑟𝑝𝑢𝑠) = log( 𝑁
(1 + 𝐷𝐹(𝑊𝑜𝑟𝑑, 𝐶𝑜𝑟𝑝𝑢𝑠)))
(2)
𝑇𝐹 𝐼𝐷𝐹(𝑊𝑜𝑟𝑑, 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡, 𝐶𝑜𝑟𝑝𝑢𝑠) = 𝑇𝐹(𝑊𝑜𝑟𝑑, 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 ) 𝐼𝐷𝐹(𝑜𝑟𝑑, 𝐶𝑜𝑟𝑝𝑢𝑠)
(3)
In this study, along with the TF-IDF process, text data will also be processed into vectors so that
the resulting text data is in the form of a TF-IDF matrix. Which later the data will be used as
training material and also testing of the model.

2.
3. Logistic regression (LR)
Logistic regression is widely used to carry out binary classification processes or
classifications that only have 2 class targets [30]. Pan et. al [31], argued that logistic regression
is a linear classification method that is easy and simple to use. Logistic regression is included in
the type of supervised learning [32]. The process carried out in logistic regression is to carry out
a linear transformation from features to probability values by using a logistic function or it can
be called a sigmoid function. So that the output issued produces a value of 0 or 1. The
mathematical formula for logistic regression can be seen in (4).

𝑃(𝐶 = 1 | 𝑍) = 1
1 + 𝑒(𝑞0 +𝑞1𝑍1+𝑞2𝑍2++𝑞𝑛𝑍𝑛)

(4)

Where :

𝑃(𝐶 = 1 | 𝑍) is the probability in class 1 with input Z

𝑒 is an Euler number, namely 2.71828

𝑞0 , 𝑞1
, 𝑞2, 𝑞𝑛 is the model parameter used during training
𝑍1
, 𝑍2, 𝑍𝑛 is a predictor variable or feature
2.
4. Support Vector Classification (SVC)

End of preview

Want to access all the pages? Upload your documents or become a member.