Cross Language Plagiarism Detection Tool - Desklib

Verified

Added on 2023/06/05

AI Summary

This article discusses the development of a cross-language plagiarism detection tool for study material at Desklib. It covers project management, requirement analysis and specification, software design, and literature review. The tool aims to detect plagiarism in documents written in different languages and uses NLTK library, GenSem library, and Google Search API for the same. The article also discusses the use of semantic plagiarism technique and different state-of-art methods to detect plagiarism. The software design is illustrated using Java script.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

IT Project Management
Student Name: ******
Student ID: ******

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Table of Contents
1 Introduction.......................................................................................................................2
2 Literature Review.............................................................................................................2
3 Project Management.........................................................................................................3
4 Requirement analysis and Specification.........................................................................9
5 Software Design...............................................................................................................10
6 Implementation...............................................................................................................10
7 Testing..............................................................................................................................11
8 Usability and User Design..............................................................................................11
9 Conclusion.......................................................................................................................11
References...............................................................................................................................12
1

1 Introduction
This application aims to take in an input from a user which would be a text file. The two
languages that can be read from the file should be English and Hindi. The users will be able
to provide the input as an UNICODE file. To achieve this we create a “Hindi representation”
of the sentence in English.The application will then search for similar files on the internet and
provide as with the results that are relevant to the text file that is uploaded. To achieve this
we create a “Hindi representation” of the sentence in English.
2 Literature Review
We went through many articles on the internet’s which were related to the development
of the cross platform plagiarism tool. An article [1] on the stack overflow suggested that we
could develop this application in Python using the NLTK library and GenSem library which
is accomplished by creating the LDA or LSA of the document. We can ultimately use the
Google Search API to search for those words. NTLK [2] is the Natural Language Toolkit for
the natural language processing. This toolkit supports libraries for classification, tokenization,
stemming, tagging, parsing, semantic reasoning etc.
In [5], Chowet. al. mentions about the semantic plagiarism technique. Semantic
plagiarism is where the sentence is reconstructed or some terms are changed into its
corresponding synonyms. Both of these plagiarisms is hardly detected due to the difference in
their fingerprints. Plagiarism detection tools that are available are not capable to detect such
plagiarism cases.
Chow et. al. in [5] proposes a new approach in detecting both cross language and
semantic plagiarism, where , the query document is shortened by utilising fuzzy swarm-based
summarisation approach, the summary will give the most important keywords in the
document. Input summary documents are translated into English using Google Translate
Application Programming Interface (API) before the words are stemmed and the stop words
are removed. Tokenized documents are sent to the Google AJAX Search API to search for
similar documents throughout the World Wide Web. Stanford Parser and Word Net are used
to determine the semantic similarity between the suspected documents with source
documents. Stanford parser assigns each terms in the sentence to their corresponding roles
such as Nouns, Verbs and Adjectives. Each sentence is then represented in a predicate form
and similarity is measured based on those predicates using information from Word Net
2

taxonomy. Testing dataset is built up from two sets of input documents which are produced
based on different plagiarism techniques.
Bird et. al. in [3] overs the scope of using the NTLK toolkit for the natural language
processing. We are thinking of using methodology where a Token class is used to represent
of unit a text such as a word, sentence or a piece of document. Kuhn et. al.[4] describes the
use of the application of semantic classification trees for the understanding of natural
language processing. Speech understanding, semantic classification, machine learning,
natural language and decision tree based capabilities for a translator application are covered
up in this paper.
These paragraphs speakabout the speech classification, machine learning based learning
of artificial neural networks, decision trees, tokenization and several other methods.In [6],
Jeremy et. al. talks about different state-of-art methods to detect the plagiarism. Some of the
methods used in the experiment are Cross-Language Character N-Gram (CL-CnG) , Cross-
Language Conceptual Thesaurus-based Similarity (CL-CTS), Cross-Language Alignment-
based Similarity Analysis (CL-ASA), Cross-Language Explicit Semantic Analysis (CL-
ESA), Translation + Monolingual Analysis (T+MA). According to the author, there is a
common behaviour of each method across different language pairs. There is not only a strong
correlation across languages but also across text units that were considered. If a method is
efficient on a particular language pair, it will be similarly efficient on another language pair
as long as enough lexical resources are available for these languages. There was a strong
correlation across types of text when they investigated the behaviour of the methods across
different types of texts on a particular language pair. It was found that a method could be
optimized on a particular collection of text and applied efficiently on another collection.
Finally, it was concluded that methods behave differently in clustering match and
mismatched units, even if they seem similar in performance.
3 Project Management
The Project Activities are shown below(Barrón-Cedeño, Gupta and Rosso, 2013).
Developing a cross-language plagiarism
detection tool
User management
Document management
Translation of input documents
3

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Translate the plagiarized Hindi documents
into English
Improve the effectiveness of the detection
process
Use Google Translate AP
Removing Stop Words
Before passing the translated documents for
comparison through the Internet
Remove the stop words in the translated text
Stemming Words
Remove the affixes
Generate root word
Pattern matching
Text Stemmer and Porter Stemmer
Use of Porter Stemming algorithm
Removing the commoner morphological and
in flexional endings from words in English
Identifying Similar Documents
Collection of documents that located around
the World Wide Web
Enables small and characteristic fragments
translation
Query documents or texts are inserted
Use of Google AJAX Search API
Comparison of Similar Pattern
Detect plagiarism
Represent the sentence uniquely.
Summary of the Result
Gathering the result
Plagiarism detection is displayed
4

Highlight the similarities between the two
files.
Resources are shown below.
Resource Name Type Initials Max. Units Std. Rate Accrue At Base Calendar
Project Manager Work P 100% $1,000.00/hr Prorated Standard
System Analyst Work S 100% $1,000.00/hr Prorated Standard
Developer Work D 100% $1,000.00/hr Prorated Standard
Designer Work D 100% $1,000.00/hr Prorated Standard
Technical Writer Work T 100% $1,000.00/hr Prorated Standard
Code Designer Work C 100% $1,000.00/hr Prorated Standard
Overall Project Activities are shown below(Chauhan, Arora and Singhal, 2017).
Task Name Duration Start Finish Predecessors Resource Names
Developing a cross-language
plagiarism detection tool 60 days Wed
9/12/18
Tue
12/4/18
User management 1 day Wed
9/12/18
Wed
9/12/18 Designer, Developer
Document management 2 days Thu
9/13/18 Fri 9/14/18 2
Designer, Project
Manager, Technical
Writer
Translation of input
documents 8 days Mon
9/17/18
Wed
9/26/18 3
Translate the plagiarized
Hindi documents into English 2 days Mon
9/17/18 Tue 9/18/18
Code Designer,
Developer, System
Analyst
Improve the effectiveness
of the detection process 3 days Wed
9/19/18 Fri 9/21/18 5 Developer
Use Google Translate AP 3 days Mon
9/24/18
Wed
9/26/18 6 Code Designer,
Designer
Removing Stop Words 5 days Thu
9/27/18
Wed
10/3/18 4
Before passing the 2 days Thu Fri 9/28/18 Designer, System
5

translated documents for
comparison through the
Internet
9/27/18 Analyst
Remove the stop words in
the translated text 3 days Mon
10/1/18
Wed
10/3/18 9 Developer, Code
Designer
Stemming Words 15 days Thu
10/4/18
Wed
10/24/18 8
Remove the affixes 3 days Thu
10/4/18
Mon
10/8/18 Designer
Generate root word 4 days Tue
10/9/18 Fri 10/12/18 12 System Analyst
Pattern matching 2 days Mon
10/15/18
Tue
10/16/18 13 Designer
Text Stemmer and Porter
Stemmer 2 days Wed
10/17/18
Thu
10/18/18 14 Developer
Use of Porter Stemming
algorithm 2 days Fri
10/19/18
Mon
10/22/18 15 Developer
Removing the commoner
morphological and in flexional
endings from words in English
2 days Tue
10/23/18
Wed
10/24/18 16 Developer, System
Analyst
Identifying Similar
Documents 10 days Thu
10/25/18
Wed
11/7/18 11
Collection of documents
that located around the World
Wide Web
2 days Thu
10/25/18 Fri 10/26/18 System Analyst
Enables small and
characteristic fragments
translation
3 days Thu
10/25/18
Mon
10/29/18 Developer
Query documents or texts
are inserted 3 days Tue
10/30/18 Thu 11/1/18 20 System Analyst,
Technical Writer
Use of Google AJAX
Search API 4 days Fri
11/2/18
Wed
11/7/18 21 Code Designer,
Developer
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Comparison of Similar
Pattern 10 days Thu
11/8/18
Wed
11/21/18 18
Detect plagiarism 4 days Tue
11/13/18 Fri 11/16/18 24
Code Designer,
Project Manager,
System Analyst
Represent the sentence
uniquely. 3 days Mon
11/19/18
Wed
11/21/18 25 System Analyst,
Technical Writer
Summary of the Result 9 days Thu
11/22/18 Tue 12/4/18 23
Gathering the result 2 days Thu
11/22/18 Fri 11/23/18 Project Manager,
System Analyst
Plagiarism detection is
displayed 3 days Mon
11/26/18
Wed
11/28/18 28 Designer, Developer,
Project Manager
Highlight the similarities
between the two files. 4 days Thu
11/29/18 Tue 12/4/18 29 Code Designer,
Developer
Project charter is shown below.
7

Resource Cost status is shown below(Ehsan and Shakery, 2016).
Project
Manager System
Analyst Developer Designer Technical
Writer Code
Designer
$0.00
$50,000.00
$100,000.00
$150,000.00
$200,000.00
$250,000.00
$300,000.00
Actual Cost Remaining Cost Baseline Cost
Project Activities Cost is shown below.
8

Developing a cross-language plagiarism detection tool
$0.00
$100,000.00
$200,000.00
$300,000.00
$400,000.00
$500,000.00
$600,000.00
$700,000.00
$800,000.00
$900,000.00
$1,000,000.00
Actual Cost Remaining Cost Baseline Cost
Name Fixed
Cost
Actual
Cost
Remaining
Cost
Cost Baseline
Cost
Cost
Variance
Developing a
cross-language
plagiarism
detection tool
$0.00 $0.00 $912,000.00 $912,000.00 $0.00 $912,000.00
MS Project file is attached here.
4 Requirement analysis and Specification
Plagiarism is turning into a difficult issue for scholarly network. The recognition of
counterfeiting at different levels is an important issue. The complexity of the issue increments
when we are finding the plagiarism detection in the source codes that might be in a similar
language or they have been changed into different languages(Franco-Salvador et al., 2016).
This kind of written falsification is found in the scholastic fills in as well as in the ventures
managing programming planning. The real issue with the source code written fabrication is
that distinctive programming languages may have different linguistic structure.
9

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

In view of language homogeneity or heterogeneity of the writings being looked at,
plagiarism detection discovery can be characterized into monolingual and cross-lingual. The
cross-language written misrepresentation recognition process is like the outside plagiarism
detection identification assignment with a few alterations in heuristic recovery and itemized
investigation stages(Gelbukh, 2009). In cross-language heuristic recovery, this stage expects
to recover the accumulation of source hopeful archives from the informational index.
Deciphering the info archive from the inquiry language to the source language might be
required in this stage. The cross-language point by point examination level estimates the
cross-language likeness between segments of the suspicious record and segments of the
hopeful reports which recovered in the past stage(Kashkur, Parshutin and Borisov, 2010).
Language used : Java script.
5 Software Design
Software Design for Cross language plagiarism detection tool is illustrated
below(Kasprowicz and Wada, 2014).
6 Implementation
Implementation code is attached here.(Lee, 2012)
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<script type="text/javascript" src="https://www.google.com/jsapi">
</script>
10

Type in Hindi (Press Ctrl+g to toggle between English and Hindi)<br>
<textarea id="transliterateTextarea" style="width:600px;height:200px"></textarea>
</body>
7 Testing
It change the tet from english to hindi.
8 Usability and User Design
If you type the any text it change from English to hindi..(Potthast et al., 2010)
12

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

9 Conclusion
With the project being accomplished, we hope that we would be able to find plagiarism
related to any articles on the web provided an input file to our application.
13

References
‘How to develop a plagiarism detector?’ stackoverflow.com/questions/1193408 extracted on
11 August 2018.
NTLK 3.3 documentation for Natural Language Toolkit extracted from nltk.org on 11 August
2018.
Steven Bird, Edward Loper NTLK: The Natural Language Toolkit
Roland Kuhn, Renato De Mori – The Application of Semantic Classification Trees to Natural
Language Understanding.
Chow Kok Kent, NaomieSalim- Web Based Cross Language Semantic Plagiarism Detection,
03 January, 2012
Jeremy Ferrero, Lauren Besacier, Didier Schwab, Frederic Agnes- Deep Investigation of
Cross-Language Plagiarism Detection MethodsBarrón-Cedeño, A., Gupta, P. and Rosso,
P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems,
50, pp.211-217.
Chauhan, S., Arora, A. and Singhal, Y. (2017). Plagiarism Detection of C Program using
Assembly Language. International Journal of Computer Applications, 158(3), pp.17-22.
Ehsan, N. and Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism
detection using two-level proximity information. Information Processing &
Management, 52(6), pp.1004-1017.
Franco-Salvador, M., Gupta, P., Rosso, P. and Banchs, R. (2016). Cross-language plagiarism
detection over continuous-space- and knowledge graph-based representations of
language. Knowledge-Based Systems, 111, pp.87-99.
Gelbukh, A. (2009). Computational Linguistics and Intelligent Text Processing. Heidelberg:
Springer.
Kashkur, M., Parshutin, S. and Borisov, A. (2010). Research into Plagiarism Cases and
Plagiarism Detection Methods. Scientific Journal of Riga Technical University.
Computer Sciences, 42(1).
Kasprowicz, D. and Wada, H. (2014). Methods for automated detection of plagiarism in
integrated-circuit layouts. Microelectronics Journal, 45(9), pp.1212-1219.
Lee, Y. (2012). Plagiarism Detection among Source Codes using Adaptive Methods. KSII
Transactions on Internet and Information Systems.
14

METHODS FOR INTRINSIC PLAGIARISM DETECTION. (2017). Informatics and
Applications.
Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P. (2010). Cross-language plagiarism
detection. Language Resources and Evaluation, 45(1), pp.45-62.
15

1 out of 16