AMAZON REVIEWS: Text Classification Using Machine Learning Models

Verified

Added on 2022/07/11

AI Summary

This project focuses on text classification of Amazon reviews using machine learning techniques. The study preprocesses review data, addressing issues like special characters, stop words, and lemmatization, before applying Multinomial Naive Bayes and Logistic Regression models for categorization. The project explores feature extraction methods like TF-IDF and n-grams to enhance classification accuracy. The research evaluates model performance based on accuracy, precision, and other metrics. The results show that while both models can classify review data, the accuracy varies, highlighting the impact of data imbalance and feature selection. The project also discusses the ethical considerations and limitations of the models, suggesting future research directions, including the application of advanced techniques like Word2Vec and exploring alternative classifiers like XGBoost and CNNs to improve the accuracy of text classification. The project also compares and contrasts the performance of different classifiers, including Logistic Regression, Naive Bayes, and XGBoost, to identify the best approach for this specific classification task.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

AMAZON REVIEWS
BAVANA
Applied Machine Learning
School of computing and mathematical sciences
Abstract: Many reviews are now available online. In addition to providing a vital source of knowledge, this user-
generated informational content, significantly impact client buying decisions. These reviews played a moderate,
significant, or critical role in their purchasing decision. Consumers have consequently become confused about relying
on online reviews. Consumers want to find useful information as rapidly as possible throughout their investigation.
Users, on the other hand, may find searching and comparing text reviews to be tedious that flooded with information.
In fact, the enormous volume of the unstructured text nature of text reviews prevents the user from selecting a
product without difficulty rather than the written content, the star rating, which ranges from 1 to 5 on Amazon,
provides a fast overview. This task is about text categorization using machine learning models to classify the review
material into a specific category. Multinomial Bayes classifiers and support vector machine classifiers are two
machine learning models that have been used. Before training the machine learning models with the peer review
data, pre-processing techniques are used. Both models perform better when it comes to classifying review data, and
this test also highlights the factors that influence text classification performance.
Keywords: Data cleaning, data processing, Feature extraction, Model training
I. Introduction
Classification, as previously stated, is a supervised machine learning activity that seeks to create a
model that can generate predictions. This model is built using annotated historical data.
Text categorization, often known as text classification, is a subset of classification. The issue with organized data
[1] claims that given a set of papers, as well as a predetermined set of classes C, the goal is to use a classification
algorithm to learn a classifier. and for each document d, forecast the best probable class c instead of individually
assigning a class to each document, which might take a long time, text categorization allows you to choose from a
list of established categories. Based on human-labeled training papers, a text document belongs. Text clarification
is the practice of breaking down a text into a group of words. Natural language processing (NLP) is used in text
categorization to analyze text and assign a set of predetermined tags or categories based on its context. NLP is
used for sentiment analysis, topic category recognition, and language translation. Text classifiers are divided into
three categories. The three primary techniques for text classification are rule-based systems, automated systems
employing machine learning methods, and hybrid systems.
II. Ethical discussion
The dataset includes both the review text and the review score and the product category with ratings. The
dataset contains no information about the authors of the papers. As a result, they have no concerns about their
privacy. This research used the decision tree and random forest machine learning models. Machine learning is
used to categorize the text, and there is a chance that the text classification will be inaccurate due to a misreading
of the words in the review text. It's also feasible that the dataset used is noisy and untrustworthy. As a result, the
dataset, as well as the machine learning algorithms, will be skewed. Now that the data has been obtained, it's time
to put it to use. In this section, we'll go over the steps necessary to forecast sentiments based on reviews of various
films. Any text categorization task can benefit from these procedures. To train text classification classifiers, we’ll
utilize Python's Scikit-Learn machine learning framework.
III. Dataset Preparation
The pandas read CSV function is used to read the dataset for this task. The many properties of the dataset are
detected when utilizing various EDA methods to explore it. The size of the dataset is determined by the number of
rows and columns in the dataset. The dataset contains 31887 rows and 5 columns. The info method can then be

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

used to inspect the data type of each column. The isnull method can be used to find null values in a dataset, and
the given dataset has them. The unique values in each column can be seen using the unique method, and the
confidence score and acceptance status have NaN values. The dropna() method could be used to eliminate null
and NaN values. As null values are eliminated from the rows. Because some index values have been removed, the
reset index methods can be used on the dataset to mark the index. The value count’s function displays the number
of values under each unique label.
The counterplot of the sklearn module is used to plot the distribution of data by category for acceptable
status. The next step is to pre-process the text after the dataset has been imported. Numbers, special characters,
and undesired spaces are all possible in text. We may or may not need to remove certain special characters and
numbers from text, depending on the problem. We shall, however, delete all special characters, numerals, and
unwanted spaces from our text for the sake of clarity. Stop words are removed. These are ordinary words like an
ability, either, else, ever, and so on that offer nothing to the classification. As a result, for our purposes, the
election was finished, and a very close game was a very close game. Words that lemmatize This is where several
inflections of the same word are grouped together. As a result, election, elections, elected, and other similar terms
would be clustered together and counted as more instances of the same word. n-grams are used. We may count
sequences of terms instead of single words, such as "clean match" and "close election," as we did here. TF-IDF is
used. Instead of just calculating frequency, we may go a step further and penalize terms that appear often in the
majority of the sentences.
IV. Methods
Logistic regression and multinomial nave Bayes algorithm are the two machine learning models used
in this work. Naive Bayes Multinomial Algorithm. This technique is particularly well suited to categorizing issues
with discrete features. Text classification, for example, can benefit from this because text parameters like word
count are distinct [6]. This technique works with both integer and fractional count characteristics. Count
vectorization and TFIDF are two feature vectors that work with this technique to enable efficient classification
[7]. This algorithm is the most accurate for the data.
1. Multinomial naïve bayes:
By no means is the Gaussian assumption the sole straightforward assumption that might be used to determine
the generating distribution for each label. Multinomial
naive Bayes is another helpful example, in which the
features are believed to be generated by a simple
multinomial distribution. Multinomial naive Bayes is best
for features that represent counts or count rates since the
multinomial distribution describes the chance of detecting
counts across multiple categories. The concept is the same
as previously, however, instead of modeling the data
distribution with the best-fit Gaussian, we model it with
the best-fit multinomial distribution.
2. Logistic regression
The second classification approach is multinomial logistic regression or MaxEnt for short. A family of
classifiers known as exponential classifiers or logarithmic linear classifiers includes logistic regression. A log-
linear classifier works like a naive Bayes by extracting a set of weighted features from an input, creating a log,
and joining them linearly (that is, each value is repeated with a weight). Will be added from). Although we

sometimes use shorthand logistic regression even when we are talking about multiple classes, logistic regression
technically refers to a classifier that classifies an observation into one of two classes, and multinomial logistic
regression is used when classifying into more than two classes. The most significant distinction between naive
Bayes and logistic regression is that the latter is a discriminative classifier, whilst the former is a generative
classifier.
V. Discussion and Future work
The primary purpose is to categorize the review content into one of two groups: accept or refuse. The
data was used to train the machine learning models, which were then able to appropriately classify the text. The
results of the text categorization that was utilized to classify the acceptance status and review score are shown
above. For acceptance status, both ML models have an accuracy score of roughly 67 percent. However, for both
machine learning models, the accuracy score for categorizing the text in the review score is only approximately
28%. The dataset imbalance has an effect on the ML model's training, which has an effect on the model's
classification accuracy.
VI. Experiments and evaluation
We noticed that nave bayes required much higher processing complexity than logistic regression and
xgboost throughout the training period. Logistic regression is a machine learning algorithm that, like linear
regression, investigates correlations between dependent and independent variables, with the exception of data
classification. The inputs are independent of one another and of equal importance, according to Nave Bayes. The
input data text is independent in nave Bayes; each characteristic is influenced by a small number of factors and
has no bearing on the others. This is a near-impossible circumstance that is regularly encountered in multiclass
text classification. xgboost is a community learning algorithm based on decision trees.
The algorithm's main feature is its scalability, which allows for rapid learning via parallel and distributed
computing while still assuring efficient memory usage. Models are compared based on how well they perform in
tests. All of the evaluation and training data points require additional time. Additionally, the number of terms in
the dictionary increased as the dictionary's length was increased. Has no major effect on accuracy One
explanation is that When we think about it, the number of data points and the dimension of feature space aren't
that dissimilar. As a result, the dimensionality curse could be at work here.
For nave bayes, the tf-idf method performs worse than the traditional word count method. The test accuracy,
however, is only 65%, indicating that this isn't the case.
To train, test, and develop a model, researchers employed four classifiers to predict category and review score
using preprocessed data of 30,197 rows text split. The data is divided into two sets: test set (18,296) and train set
(18,296). (12,398). LR, NB, XGBOOST, and LR (tf-idf) with pipeline features extracted and models evaluated
After that, trained models evaluated the module's performance using 1. accuracy and 2. precision. 3. Confusion
matrix, 4. FIscore, and 5. Recall
VII. Future work
Including diminution and determination are clearly important with the end objective of presumption
grouping. Aside from the strategies discussed in this paper, there are a variety of other options that can be studied
to narrow down the options even further. The POS labeling component can be used to label words in preparation
information and focus on the imperative words in light of the labels. Finally, order modifiers are the most
fundamental labels. Different labels must also be dealt with, some of which may have some predictive value.
Other advanced systems, such as those that use Word2Vec, can also be used. This would find similar data values
and, in essence, establish a link between names. There are several classes that will use Word to Vector to improve
modeling.
3. Conclusion
Four different models were used to predict if a product review on Amazon was good or negative. With
a validation accuracy of 0.9107, the Convolutional Neural Network outperformed the other three models. With a
validation accuracy of 0.8023, SVM was the worst performing model. The max features hyperparameter in the tf-

idf output that is supplied into the SVM model can be increased to improve this. This will, however, make model
training extremely computationally intensive. Our models function admirably on previously unknown novel
examples, as evidenced by the sample examples.
"Text Classification with Python and Scikit-Learn." 17 Feb. 2019,
https://stackabuse.com/text-classification-with-python-and-scikit-learn/.
"Classification of text documents using ... - scikit-learn."
https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html
.
"Multi-Class Text Classification with Scikit-Learn | by ...." 19 Feb. 2018,
https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f.
"MultiClass Text Classification with Scikit-Learn - Jean Snyman."
https://www.jeansnyman.com/posts/multi-class-text-classification-with-scikit-learn/.
"Working With Text Data — scikit-learn 1.0.2 documentation."
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.
"Text classification with SVM using python and Scikit Learn ...." 02 May. 2018,
https://ranithsachin.wordpress.com/2018/05/02/text-classification-with-svm-using-python-and-scikit-
learn/.
"Text Classification with NLTK and Scikit-Learn | Libelli." 19 May. 2016,
https://bbengfort.github.io/2016/05/text-classification-nltk-sckit-learn/."Data Preprocessing:
Concepts. Introduction to the concepts ...." 25 Nov. 2019, https://towardsdatascience.com/data-
preprocessing-concepts-fa946d11c825.
"Data Preprocessing: Python, Machine Learning, Examples and ...." 16 Mar. 2022,
https://blog.quantinsti.com/data-preprocessing/.