Cross Language Plagiarism Detection Tool - Desklib

16 Pages2645 Words155 Views

Added on 2023-06-05

About This Document

This article discusses the development of a cross-language plagiarism detection tool for study material at Desklib. It covers project management, requirement analysis and specification, software design, and literature review. The tool aims to detect plagiarism in documents written in different languages and uses NLTK library, GenSem library, and Google Search API for the same. The article also discusses the use of semantic plagiarism technique and different state-of-art methods to detect plagiarism. The software design is illustrated using Java script.

Cross Language Plagiarism Detection Tool - Desklib

Added on 2023-06-05

Related Documents

IT Project Management
Student Name: ******
Student ID: ******

Cross Language Plagiarism Detection Tool - Desklib_1

Table of Contents
1 Introduction.......................................................................................................................2
2 Literature Review.............................................................................................................2
3 Project Management.........................................................................................................3
4 Requirement analysis and Specification.........................................................................9
5 Software Design...............................................................................................................10
6 Implementation...............................................................................................................10
7 Testing..............................................................................................................................11
8 Usability and User Design..............................................................................................11
9 Conclusion.......................................................................................................................11
References...............................................................................................................................12
1

Cross Language Plagiarism Detection Tool - Desklib_2

1 Introduction
This application aims to take in an input from a user which would be a text file. The two
languages that can be read from the file should be English and Hindi. The users will be able
to provide the input as an UNICODE file. To achieve this we create a “Hindi representation”
of the sentence in English.The application will then search for similar files on the internet and
provide as with the results that are relevant to the text file that is uploaded. To achieve this
we create a “Hindi representation” of the sentence in English.
2 Literature Review
We went through many articles on the internet’s which were related to the development
of the cross platform plagiarism tool. An article [1] on the stack overflow suggested that we
could develop this application in Python using the NLTK library and GenSem library which
is accomplished by creating the LDA or LSA of the document. We can ultimately use the
Google Search API to search for those words. NTLK [2] is the Natural Language Toolkit for
the natural language processing. This toolkit supports libraries for classification, tokenization,
stemming, tagging, parsing, semantic reasoning etc.
In [5], Chowet. al. mentions about the semantic plagiarism technique. Semantic
plagiarism is where the sentence is reconstructed or some terms are changed into its
corresponding synonyms. Both of these plagiarisms is hardly detected due to the difference in
their fingerprints. Plagiarism detection tools that are available are not capable to detect such
plagiarism cases.
Chow et. al. in [5] proposes a new approach in detecting both cross language and
semantic plagiarism, where , the query document is shortened by utilising fuzzy swarm-based
summarisation approach, the summary will give the most important keywords in the
document. Input summary documents are translated into English using Google Translate
Application Programming Interface (API) before the words are stemmed and the stop words
are removed. Tokenized documents are sent to the Google AJAX Search API to search for
similar documents throughout the World Wide Web. Stanford Parser and Word Net are used
to determine the semantic similarity between the suspected documents with source
documents. Stanford parser assigns each terms in the sentence to their corresponding roles
such as Nouns, Verbs and Adjectives. Each sentence is then represented in a predicate form
and similarity is measured based on those predicates using information from Word Net
2

Cross Language Plagiarism Detection Tool - Desklib_3

taxonomy. Testing dataset is built up from two sets of input documents which are produced
based on different plagiarism techniques.
Bird et. al. in [3] overs the scope of using the NTLK toolkit for the natural language
processing. We are thinking of using methodology where a Token class is used to represent
of unit a text such as a word, sentence or a piece of document. Kuhn et. al.[4] describes the
use of the application of semantic classification trees for the understanding of natural
language processing. Speech understanding, semantic classification, machine learning,
natural language and decision tree based capabilities for a translator application are covered
up in this paper.
These paragraphs speakabout the speech classification, machine learning based learning
of artificial neural networks, decision trees, tokenization and several other methods.In [6],
Jeremy et. al. talks about different state-of-art methods to detect the plagiarism. Some of the
methods used in the experiment are Cross-Language Character N-Gram (CL-CnG) , Cross-
Language Conceptual Thesaurus-based Similarity (CL-CTS), Cross-Language Alignment-
based Similarity Analysis (CL-ASA), Cross-Language Explicit Semantic Analysis (CL-
ESA), Translation + Monolingual Analysis (T+MA). According to the author, there is a
common behaviour of each method across different language pairs. There is not only a strong
correlation across languages but also across text units that were considered. If a method is
efficient on a particular language pair, it will be similarly efficient on another language pair
as long as enough lexical resources are available for these languages. There was a strong
correlation across types of text when they investigated the behaviour of the methods across
different types of texts on a particular language pair. It was found that a method could be
optimized on a particular collection of text and applied efficiently on another collection.
Finally, it was concluded that methods behave differently in clustering match and
mismatched units, even if they seem similar in performance.
3 Project Management
The Project Activities are shown below(Barrón-Cedeño, Gupta and Rosso, 2013).
Developing a cross-language plagiarism
detection tool
User management
Document management
Translation of input documents
3

Cross Language Plagiarism Detection Tool - Desklib_4

End of preview

Want to access all the pages? Upload your documents or become a member.