Plagiarism Detection: Kohonen Maps and SVD Techniques Report

Verified

Added on  2023/01/13

|9
|4216
|53
Report
AI Summary
This report presents a novel approach to plagiarism detection utilizing Kohonen Maps and Singular Value Decomposition (SVD). The research addresses the growing concern of plagiarism in the digital age by leveraging intrinsic plagiarism detection (IPD) techniques, specifically focusing on stylometry and authorship verification. The methodology involves analyzing writing styles through the application of Kohonen Maps, a type of self-organizing map, for clustering and visualizing similarities between documents. SVD is employed to reduce the dimensionality of the data, facilitating faster and more efficient comparison of suspicious documents with original sources. The study explores various style markers, including character-level statistics, part-of-speech analysis, and sentence-level characteristics, to identify instances of plagiarism. The report also reviews existing plagiarism detection methods, including supervised and unsupervised learning approaches, highlighting the advantages of the proposed Kohonen Maps and SVD-based technique. The conclusion emphasizes the effectiveness of this approach in identifying copied content and verifying authorship, contributing to the ongoing development of more robust plagiarism detection tools.
Document Page
Plagiarism detection
Using Kohonen Maps and Singular Value Decomposition
for Plagiarism Detection
1
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Plagiarism detection
Abstract
In this digital age when the amount of data
available in the digital form is increasing enormously
over the internet the problem of countering the
plagiarism is becoming challenging day by day to
counter. The easy availability of Research papers ,
analysis reports , books and other documents the
increase in plagiarism is reaching at its troll. This
paper introduces the new approach for the detection
of the plagiarism using the Kohonen Maps Singular
value decomposition for detecting the Plagiarism in
the effective manner. Researchers are focusing on
detecting the plagiarism using the more effective
techniques by comparing the suspicious document
with the original documents. These techniques are
widely used but more effective intrinsic techniques
are required to detect the plagiarism such as
comparing the writing style of the author and the
identification of the idiolect through intrinsic
plagiarism detection (IPD) has been used to solve this
problem. In order to perform this technique of
Kohonen maps and singular value decomposition is
used to achieve this objective. Here in this paper the
discussion on the application of the Kohonen maps
and singular value decomposition will be presented.
Keywords – Plagiarism, Intrinsic Plagiarism
Detection, Koonen Maps , Singular Value
decomposition, Authorship authentication, self
organizing maps
I Introduction
In general Plagiarism refers to copying the other’s
work without giving proper acknowledgement to the
author, plagiarism not only refers to copying the
content of some -one else work but it also refers to
copying the idea too. In this process the authorization
through the writing style of a particular person can be
used to detect for checking the plagiarism. Every
person has some specific style of writing; they use
some specific words in their work on regular basis.
This is also called idiolect, in several cases it is
observed that data is copied or ideas are taken from
various sources. This directly indicates that the
author has not written the document by own-self. So
it can be stated that these two approaches are very
useful in detecting the plagiarism in the digital
content firstly writing style and portions of data
accessed through various sources has been accessed .
Along with this if in a document more than two
idiolect is found then it is referred as availability of
plagiarism in the document.
The basic concept of the stylometry is that every
author uses a set of quantifiable characteristics that is
referred as the style markers . These style markers are
completely inevitable in a particular author’s work.
In studying the stylometry the analysis of the
paragraph, sentences, words etc is done to identify
the writing style of a particular person. By doing the
countermarks by subdividing the paragraphs and
sentences the idiolects are set. The main purpose of
analyzing the stylometry is to observe the anomalies
in the written work of a particular person on the
regular basis, these common anomalies observed in
several work parts of a person helps to determine
whether a particular document is written by the same
author or not . These style markers are great help in
detecting the plagiarism.
Along with above in several cases it is observed
that students from specific group or course frequently
visit particular set of websites to collect the
information for their assignments and by analyzing
the compare study of the content available in these
2
Document Page
Plagiarism detection
particular sites with content submitted by the student
the plagiarism ratio is detected. Though there are
widely available plagiarism detecting tools that
follow the same concept but there are several
problems observed in their technique. Firstly it is
required to have a corpus (set of text ) for making the
comparison of the document with other documents
and most importantly selected corpus should relate to
the subject of the document. Another problem for
these plagiarism detector are to find the related
documents to make comparison with and this
becomes more difficult when the plagiarizer uses the
content from the hard copy of books and other
sources. Another difficulty is that all the plagiarism
detector does not have enough capability to detect the
level of plagiarism approaches applied by the
plagiarizer , most importantly these plagiarism
detectors use different approaches to identify the
plagiarism in the document.
In plagiarism detection the two important aspects
are considered , firstly the authorship attribution and
secondly the authorship verification. In authorship
attribution the several given texts are involved with
given texts which belong to different authors and
classify the given texts to the authors. It takes use of
the concepts of clustering. In case of authorship
verification it is determined whether the document is
written by the author or not. Here the factors of
intrinsic plagiarism detection technique are used.
II Previous work in similar
field of plagiarism detection
The main concept behind the style- metric is
detecting the writing style pattern of a particular
author as it is evident that each and every author has
its own style of writing the every author uses a set of
quantifiable characteristics that is referred as the style
markers . These style markers are completely
inevitable in a particular author’s work. In studying
the stylometry the analysis of the paragraph ,
sentences , words etc is done to identify the writing
style of a particular person. There are style markers
employed in detecting the writing style of the
particular author that helps in detecting the
plagiarism in submitted document.
A. Using the style markers as features of the
document
According to Mayer and Stain , the style markers are
categorized in the five categories such as –(i)
Character level statistics , (ii) Part of speech (iii)
counting of special words , (iv) sentence level text
statistics (v) structurally average word per frequency
class. Along with this other parameters that are
considered while designing the document are lexical
level of document , vocabulary richness and retrieve
level etc. Here the simple ratio concept is also used
which refers to average word length, sentence Length
and syllables used per sentence in the document. The
readability is measured in terms of fresh syllables
used in a part of sentence and higher the complexity
of syllables lower the readability. Another factor that
helps in deciding the authorship of the document is
vocabulary richness in document , every individual
has its own set of vocabulary according to their level
of proficiency of language so this also helps in
determining whether the document has written by the
author himself or not. For observing the vocabulary
the tags of words are used to determine the ratio of
vocabulary but still it is difficult to analyze as one
particular approach suitable for examining one
document is not justifiable for another document and
it also varies depending upon the size of the
document.
3
Document Page
Plagiarism detection
B. Approaches used for Analyzing the
plagiarism
According to various research reports the various
approaches are used for analyzing the plagiarism, in
this series the supervised learning is used for
detecting the authorship verification. In supervised
learning the classifier classifies the plagiarized and
un-plagiarized content in classes that helps in
determining the plagiarism in document. In this
approach the Bayesian classifier, naïve Bays
classifier, Support Vector Machine (SVM) and
multivariate classifier approaches are used. But
several problems have been countered in this
approach such as there is a definite requirement of
the base document to compare with to track the
plagiarism in the document. In case data has been
taken from several places then comparing the single
document from several document in the same subject
area is also a complex task.
Along with this many researchers had proposed the
approaches of Un-Supervised learning for detecting
the plagiarism in the concerned document. In this
approach the tracking of idiolects in the documents
used as discussed in above section are used for
identifying the plagiarism in the document. By doing
the countermarks by subdividing the paragraphs and
sentences the idiolects as a detecting set the
parameters for analyzing can be observed. The main
purpose of analyzing the stylometry is to observe the
anomalies in the written work of a particular person
on the regular basis, these common anomalies
observed in document helps to determine whether the
document is written by author himself or not. A
document contains a varied set of idiolects , the main
task of the Un-supervised learning is to detect the
unknown idiolects too so that more strong
verification can be used.
III Kohen Maps used as self
organizing maps for detecting the plagiarism
According to neural science our brain self organize
the collected views and concepts and according to it
make decision. This same ideology has been tried to
use in the Kohen maps. Mapping of the neuron cells
is done by incoming of each new neuron in the brain
and according to its type the mapping is performed.
In this approach the clustering technique is utilized to
provide the support in self organizing map concept.
In SOM the similarity network graph of the high
dimensional space to the simple geometric mapping
is done . In Kohenen Map the mapping is done with
the similar neurons geometrically put together .
This architecture helps to identify the unknown
neurons easily. In this mapping three steps are
followed – ((I) Initialization, (ii) Training and (iii)
visualization used in classification and clustering. In
the initialization process the each vector of the input
space is considered as n-dimensional and for each
neuron the mapping is assigned as a prototype vector
the data set is randomly or linearly as per requirement
is assigned. After performing the process of training
these prototype vectors behave as the example for
processing of other neurons.
4
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Plagiarism detection
In the training process, let i be a neuron in the n x n
grid and mi be the prototype vector associated with
the i and x ϵ Rn be any arbitrary vector. Now in
order to map this x to any one of the neuron , the
following formula will be used to compute the
distance –
Di = min(i) = ( || x- mi||)
For better statistics the –
Di = max (i) = ( || x- mi||)
The neuron which satisfy the above condition is
considered for further use and is denoted by b . The
topology for the neighboring neuron is given by –
Mi (t+1) = mi(t) + ά(t) hbi (t) [x- mi(t)]
Where t refers to the discrete time coordinate value
and mi (t+1) at (t+1) and neighborhood kernel hb(t) is
defined as
hbi (t) = exp [ -||rb- ri||2]/ 2σ2(t)
Conclusion
Plagiarism refers to copying the other’s work
without giving proper acknowledgement to the
author, plagiarism not only refers to copying the
content of some -one else work but it also refers to
copying the idea too. In this process the authorization
through the writing style of a particular person can be
used to detect for checking the plagiarism. Every
person has some specific style of writing, they use
some specific words in their work on regular basis.
This is also called idiolect, in several cases it is
observed that data is copied or ideas are taken from
various sources. This directly indicates that the
author has not written the document by own-self.
The implication of Kohonen map s and singular value
decomposition helps a lot in plagiarism detection.
References
[1]
J. Carroll, A Handbook for Deterring of Plagiarism in
Higher
Education
.: Oxford: Oxford Center for Staff and Learning
Development,
2002.
[2]
M. Joy and M. Luck, "Plagiarism in Programming
Assignment," IEEE
Transactions of Education, vol. 42(2), pp. 129-133,
1999.
[3]
B. Martin, "'Plagiarism: a misplaced emphasis,"
Journal of Information
Ethics, vol. 3(2), pp. 36-47, 1994.
[4]
R.M. Coulthard, "Author identification, idiolect and
linguistic
uniqueness, Applied Linguistics Dumas B
‘Reasonable doubt about
reasonable doubt: assessing jury instruction adequacy
in a capital case,"
in J Cotterill (ed), 2002, pp. 246-259.
[5]
M. Koppel and J. Schler, "An Authorship verification
as a one-class
classification problem," in 21stInternational
Conference on Machine
Learning, Banff, Canada, ACM Press., 2004.
[6]
D. I. Holmes, "Authorship Attribution," Computers
and the Humanities,
vol. 28 (2), pp. 87–106, 1994.
[7]
S. Meyer zu Eissen and B. Stein, "Intrinsic plagiarism
detection," in
Lalmas et al. (Eds.): Advances in Information
Retrieval Proceedings of
the 28th European Conference on IR Research
(ECIR), vol. 3936 of
5
Document Page
Plagiarism detection
Lecture Notes in Computer Science, Landon, 2006
Springer, pp. 565-
569.
[8]
K. Marco, "Using Style Markers for Detecting
Plagiarism in Natural
Language Documents," Department of Computer
Science, University of
Sk¨ovde, Sk¨ovde, Theses 2003.
[9]
G. U. Yule, "On Sentence-Length as a Statistical
Characteristic of Style
in Prose: With Application to two Cases of Disputed
Authorship,"
Biometrika, vol. 30 (3/4), pp. 363–390, 1939.
[10]
W. Fucks, "On Mathematical Analysis of Style,"
Biometrika, vol. 39
(1/2), pp. 122 – 129, 1952.
[
[1]
J. Carroll, A Handbook for Deterring of Plagiarism in
Higher
Education
.: Oxford: Oxford Center for Staff and Learning
Development,
2002.
[2]
M. Joy and M. Luck, "Plagiarism in Programming
Assignment," IEEE
Transactions of Education, vol. 42(2), pp. 129-133,
1999.
[3]
B. Martin, "'Plagiarism: a misplaced emphasis,"
Journal of Information
Ethics, vol. 3(2), pp. 36-47, 1994.
[4]
R.M. Coulthard, "Author identification, idiolect and
linguistic
uniqueness, Applied Linguistics Dumas B
‘Reasonable doubt about
reasonable doubt: assessing jury instruction adequacy
in a capital case,"
in J Cotterill (ed), 2002, pp. 246-259.
[5]
M. Koppel and J. Schler, "An Authorship verification
as a one-class
classification problem," in 21stInternational
Conference on Machine
Learning, Banff, Canada, ACM Press., 2004.
[6]
D. I. Holmes, "Authorship Attribution," Computers
and the Humanities,
vol. 28 (2), pp. 87–106, 1994.
[7]
S. Meyer zu Eissen and B. Stein, "Intrinsic plagiarism
detection," in
Lalmas et al. (Eds.): Advances in Information
Retrieval Proceedings of
the 28th European Conference on IR Research
(ECIR), vol. 3936 of
Lecture Notes in Computer Science, Landon, 2006
Springer, pp. 565-
569.
[8]
K. Marco, "Using Style Markers for Detecting
Plagiarism in Natural
Language Documents," Department of Computer
Science, University of
Sk¨ovde, Sk¨ovde, Theses 2003.
[9]
G. U. Yule, "On Sentence-Length as a Statistical
Characteristic of Style
in Prose: With Application to two Cases of Disputed
Authorship,"
Biometrika, vol. 30 (3/4), pp. 363–390, 1939.
[10]
W. Fucks, "On Mathematical Analysis of Style,"
Biometrika, vol. 39
(1/2), pp. 122 – 129, 1952.
[
1]
J. Carroll, A Handbook for Deterring of Plagiarism in
Higher
Education
.: Oxford: Oxford Center for Staff and Learning
Development,
2002.
[2]
M. Joy and M. Luck, "Plagiarism in Programming
Assignment," IEEE
6
Document Page
Plagiarism detection
Transactions of Education, vol. 42(2), pp. 129-133,
1999.
[3]
B. Martin, "'Plagiarism: a misplaced emphasis,"
Journal of Information
Ethics, vol. 3(2), pp. 36-47, 1994.
[4]
R.M. Coulthard, "Author identification, idiolect and
linguistic
uniqueness, Applied Linguistics Dumas B
‘Reasonable doubt about
reasonable doubt: assessing jury instruction adequacy
in a capital case,"
in J Cotterill (ed), 2002, pp. 246-259.
[5]
M. Koppel and J. Schler, "An Authorship verification
as a one-class
classification problem," in 21stInternational
Conference on Machine
Learning, Banff, Canada, ACM Press., 2004.
[6]
D. I. Holmes, "Authorship Attribution," Computers
and the Humanities,
vol. 28 (2), pp. 87–106, 1994.
[7]
S. Meyer zu Eissen and B. Stein, "Intrinsic plagiarism
detection," in
Lalmas et al. (Eds.): Advances in Information
Retrieval Proceedings of
the 28th European Conference on IR Research
(ECIR), vol. 3936 of
Lecture Notes in Computer Science, Landon, 2006
Springer, pp. 565-
569.
[8]
K. Marco, "Using Style Markers for Detecting
Plagiarism in Natural
Language Documents," Department of Computer
Science, University of
Sk¨ovde, Sk¨ovde, Theses 2003.
[9]
G. U. Yule, "On Sentence-Length as a Statistical
Characteristic of Style
in Prose: With Application to two Cases of Disputed
Authorship,"
Biometrika, vol. 30 (3/4), pp. 363–390, 1939.
[10]
W. Fucks, "On Mathematical Analysis of Style,"
Biometrika, vol. 39
(1/2), pp. 122 – 129, 1952.
[1
1]
J. Carroll, A Handbook for Deterring of Plagiarism in
Higher
Education
.: Oxford: Oxford Center for Staff and Learning
Development,
2002.
[2]
M. Joy and M. Luck, "Plagiarism in Programming
Assignment," IEEE
Transactions of Education, vol. 42(2), pp. 129-133,
1999.
[3]
B. Martin, "'Plagiarism: a misplaced emphasis,"
Journal of Information
Ethics, vol. 3(2), pp. 36-47, 1994.
[4]
R.M. Coulthard, "Author identification, idiolect and
linguistic
uniqueness, Applied Linguistics Dumas B
‘Reasonable doubt about
reasonable doubt: assessing jury instruction adequacy
in a capital case,"
in J Cotterill (ed), 2002, pp. 246-259.
[5]
M. Koppel and J. Schler, "An Authorship verification
as a one-class
classification problem," in 21stInternational
Conference on Machine
Learning, Banff, Canada, ACM Press., 2004.
[6]
D. I. Holmes, "Authorship Attribution," Computers
and the Humanities,
vol. 28 (2), pp. 87–106, 1994.
[7]
S. Meyer zu Eissen and B. Stein, "Intrinsic plagiarism
detection," in
Lalmas et al. (Eds.): Advances in Information
Retrieval Proceedings of
the 28th European Conference on IR Research
(ECIR), vol. 3936 of
7
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Plagiarism detection
Lecture Notes in Computer Science, Landon, 2006
Springer, pp. 565-
569.
[8]
K. Marco, "Using Style Markers for Detecting
Plagiarism in Natural
Language Documents," Department of Computer
Science, University of
Sk¨ovde, Sk¨ovde, Theses 2003.
[9]
G. U. Yule, "On Sentence-Length as a Statistical
Characteristic of Style
in Prose: With Application to two Cases of Disputed
Authorship,"
Biometrika, vol. 30 (3/4), pp. 363–390, 1939.
[10]
W. Fucks, "On Mathematical Analysis of Style,"
Biometrika, vol. 39
(1/2), pp. 122 – 129, 1952.
[1
[1]Richard Sutton and Andrew Barto . Reinforcement
Learning. MIT Press. ISBN 0-585-02445-6, 1998.4.
http://en.wikipedia.org/wiki/Temporal-Difference-
learning ,retrieved on 4-11-2013, 2:15 pm.5.
[2]Romans Lukashenko, Vita Graudina, Janis
Grundspenkis, “Computer Based Plagiarism
Detection Methods and Tools: An Overview” in
International
Conference on Computer Systems and Technologie
CompSysTech, 2007.6.
[3] Bao Jun-Peng,Shen, Jun-Yi,Liu Xiao-Dong,Song,
Qin-Bao,
“A Survey on Natural Language Text Copy
Detection[J]”,Journal of Software, vol.14, No.10,
pp.1753-1760(Ch), 2003.7.
[4]Du Zou, Wei-Jiang Long, Zhang Ling, “A Two
-
Phase Plagiarism Detection Method” in NSFC
(National Natural Science Foundation of China, ID:
60603022), CNGI (China's Next Generation Internet,
ID: 2008-122).8.
[5]Asim M. El Tahir Ali, Hussam M. Dahwa
Abdulla, Vaclav Snasel, Ivo Vondrak “Using
Kohonen Maps and Singular Value Decompositio
n for Plagiarism Detection”, Third International
Conference on Computational Intelligence,
Communication Systems and Networks, IEEE,
2011.9.
[5] Chow Kok Kent, Naomie Salim “Web based
Cross Language Semantic Plagiarism Detection”
Ninth IEEE International Conference on Dependable,
Autonomic and Secure Computing, 2011.10.
[6]Sutton R S. Learning to Predict by the methods of
temporal differences. Machine Lear[1]ning, 1988.
[7] Wang Qiang, Zhan Zhongli “Reinforcement
Learning
Model, Algorithms and Its Application” International
Conference on Mechatronic Science, Electric
Engineering and Computer, 2
011.11.
[8]Bao Jun-Peng,Shen, Jun-Yi,Liu Xiao-Dong,Song,
Qin-Bao,“A Survey on Natural Language Text Copy
Detection[J]”,Journal of Software, vol.14, No.10,
pp.1753-1760(Ch), 2003.7.
[9]Du Zou, Wei-
Jiang Long, Zhang Ling, “A Two
-Phase Plagiarism Detection Method” in NSFC
(National Natural Science Foundation of China, ID:
60603022), CNGI (China's Next Generation Internet,
ID: 2008-122).8.
[10]Asim M. El Tahir Ali, Hussam M. Dahwa
Abdulla, Vaclav Snasel, Ivo Vondrak “Using
Kohonen Maps and Singular Value Decomposition
for Plagiarism
Detection”, Third International Conference on
Computational Intelligence, Communication Systems
and Networks, IEEE, 2011.9.
[11]Chow Kok Kent, Naomie Salim “Web based
Cross Language Semantic Plagiarism Detection”
Ninth IEEE International Conference on D
ependable,Autonomic and Secure Computing,
2011.10.
[12]Sutton R S.
Learning to Predict by the methods of temporal
differences. Machine Learning, 1988.Wang Qiang,
Zhan Zhongli “Reinforcement Learning
Model, Algorithms and Its Application” International
Conference on Mechatronic Science, Electric
Engineering and Computer, 2011.11.
[13]Richard S. Sutton and Andrew G. Barto
“Reinforcement Learning: An Introduction” MIT
Press, Cambridge, MA, 1998.
8
Document Page
Plagiarism detection
[14] 1. Gipp, B.; Meuschke, N. Citation Pattern
Matching Algorithms for Citation-Based Plagiarism
Detection: Greedy Citation Tiling, Citation Chunking
and Longest Common Citation Sequence. In
Proceedings of the2011 ACM Symposium on
Document Engineering, Mountain View, CA, USA,
19–22 September 2011.
[15] Asuncion, G.P ,The Plagiarism detection, John
Wiley &Sons , 2011
9
chevron_up_icon
1 out of 9
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]