Text Classification Using Naïve Bayes

Verified

Added on 2023/04/22

AI Summary

This document explains the process of text classification using Naïve Bayes algorithm for Natural Language Processing. It covers the results, matrix vector, classification, Naïve Bayes classification, Support Vector Machine Classification, and conclusion. The document also includes Bibliography.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Text Classification
Using Naïve Bayes
and Support Vector
Machine
2019
Natural Language Processing
Student
[Company name] | [Company address]

Contents
Results..............................................................................................................................................2
Loading Training data into WEKA.............................................................................................2
Matrix Vector...............................................................................................................................3
Classification:..............................................................................................................................6
Naïve Bayes classification.......................................................................................................6
Support Vector Machine Classification...................................................................................8
Conclusion.....................................................................................................................................11
Bibliography..................................................................................................................................13

Results
Loading Training data into WEKA
The original datasets are in text formats so we convert the files into comma separated values
(.csv) using excel software to enable loading into WEKA after which they are converted into
Attribute-Relation File Format (.arff).
After opening the WEKA program, we import the training dataset through:
Click Explorer>>>>Open File>>>select file>>>>Open
Figure 1: Training dataset

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

From figure 1, there are: 1096 instances of student attributes, 750 instances of faculty entries,
336 instances of project entries, and 620 instances of course entries. The dataset has nominal
entries with no missing values.
Matrix Vector
After we’ve imported traindata.arff, we apply filters to convert the word document into a word
matrix by clicking on choose>>>>unsupervised>>>>attribute>>>>StringToWordVector and
click on StringToWordVector to modify the default parameters into the highlighted parameters
as shown in figure 2.
Figure 2: Parmeters for training datset filtering
After that click OK, and then by click edit>>>>right click in order to set the attribute as
class>>>>then click OK to obtain:

Figure 3: Word document vector
So that by clicking any item we obtain:
Figure 4: Data visualization after converting to word vector

Classification:
Naïve Bayes classification
After loading the traindata.arff file and applying filters, click Classify>>>>Choose>>>>
Classifiers>>>>NaïveBayes
Modify the parameters as highlighted in figure 5 so that we use a kernel estimator to improve the
classification accuracy:
Figure 5: Naïve Bayes parameters for Training dataset
We then use the testdata.arff as the test set by clicking Supplied test set (Set)>>>>Open
file>>>>choose testdata.arff>>>>Open>>>>and Start the model.
Parameters
During Naïve Bayes, there was use of kernel estimators, but no supervised discretization.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Output
Figure 6: Naive Bayes on Training set
Figure 7: Naive Bayes training model on test set

In using the Naïve Bayes classifier, we realize that only 38.9247% of the instances in the test
dataset are correctly classified while 61.0753% of the instances are incorrectly classified (figure
7). However, when applying the model to the training set alone, the accuracy is higher at
87.3118% correct classification and 12.6882% wrongly classified (figure 6). In the confusion
matrix, 374 instances of faculty were incorrectly classified under student while 160 instances of
project were classified as student and 310 instances of course classified as students when testing
the training model on the test data. Whereas 4 course instances misclassified under student 2
under faculty with 3 faculty instances misclassified under student in the training model.
Support Vector Machine Classification
Since there is no explicit function of SVM, we apply the Sequential Minimal Optimization
method of training which is a variant of SVM through:
Click Choose>>>>Functions>>>>SMO>>>>Close.
We then use the testdata.arff as the test set by clicking Supplied test set (Set)>>>>Open
file>>>>choose testdata.arff>>>>Open>>>>and Start the model.
Parameters
We modify the parameters as highlighted in by clicking the SMO function (Figure 8)

Figure 8: SMO function
Figure 9: Parameters adjustment for SMO
Output

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

When applying the SVM model to the training data alone, the model correctly classified
99.2115% of the instances with only 0.7885% of the instances being misclassified i.e. 11
instances. Under Detailed accuracy by class, 4 misclassified instances of student under course, 5
faculty instances misclassified under course as well as 2 of project under course as well (figure
10).
Figure 10: Support Vector Machine for train Dataset

Figure 11: SVM on test data using training set
In using the SMO classifier, we realize that only 38.9247% of the instances in the test dataset are
correctly classified while 61.0753% of the instances are incorrectly classified. From the
confusion matrix, 374 instances of faculty were incorrectly classified under student while 160
instances of project were classified as student and 310 instances of course classified as students
when testing the training model on the test data (Figure 11).
Figure 12: SMO model on test data using training set
Conclusion
Both the Naïve Bayes and Support Vector Machine classification algorithms have different
output results. Such may be due to their sensitivity towards parameter optimization and hence

given our application of different optimization techniques used, the choice of kernel functions
the difference can thus be explained. In this study the SVM and Naïve Bayes classifiers predicted
the test data using the training set with the same accuracy level i.e., 38.9247%. However, the
SVM performed better than the Naïve Bayes classification with an accuracy of approximately
99% compared to that Naïve Bayes which is approximately 87% when classifying the training
sets. Therefore, SVM would be the best Algorithm for text Classification given the datasets that
were presented.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Bibliography
Deng Z., Zhang M. (2015) Improving Text Categorization Using the Importance of Words in
Different Categories. In: Hao Y. et al. (eds) Computational Intelligence and Security.
Heidelberg: Springer.
Gupta, S. (2018). Text Classification: Application and Use. Retrieved from:
https://blog.paralleldots.com/product/text-analytics/text-classification-applications-use-
cases/
Joachims, T. (2011). Text Categorization with Support Vector Machines: Learning with
Many Relevant Features. Heidelberg: Springer.
MonkeyLearn. (2018). A Comprehensive Guide to Classifying Text with Machine Learning.
Retrieved from: https://monkeylearn.com/text-classification/
Preslav, N., Ariel, S., Brian, W., & Marti, H. (2015). Supporting annotation layers for natural
language processing. ACL Poster/Demo Track, 3(7), pp. 72-88
Raschka, S. (2014). Naïve Bayes. Retrieved from:
https://sebastianraschka.com/Articles/2014_naive_bayes_1.html
Thangara, M. & Sivakami, M. (2018). Text Classification Techniques: A Literature Review.
Information, Knowledge and Management, 13(1). Pp. 118-130. DOI: 10.28945/4066
Verspoor, M. & Cohen, B. (2013). Natural Language Processing. Computational and
Information