Comprehensive Report: XML Information Retrieval System and Models

Verified

Added on 2023/01/16

AI Summary

This report provides an overview of XML information retrieval systems, focusing on the processes involved in capturing, storing, and retrieving data from XML databases. It introduces the concept of databases and highlights the importance of efficient retrieval methods. The report delves into various information retrieval models, including the Boolean, vector space, and probabilistic models, explaining their functionalities and applications. It also discusses the significance of preprocessing steps such as stopword removal, stemming, and indexing to improve search accuracy. Furthermore, the report explores different XML storage techniques, such as text approach, relational DTD approach, and others, providing insights into their advantages and disadvantages. The document emphasizes the role of XML in data representation and exchange, particularly in the context of web services and data-centric applications. The report aims to provide a comprehensive understanding of the challenges and solutions related to XML information retrieval.

Running head: INFORMATION RETRIEVAL IN XML BASED DATABASE 1
XML INFORMATION RETRIEVAL SYSTEM
Student’s name
Institutional
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Retrieval system 2
Abstract
The purpose of this paper is to report on development of extensible Markup-Language
(XML) retrieval system that has matched an extensive development of the XML –information
repositories. It has emerged as the dominant standard used in representing data as well as
exchanging it over the global network.

Retrieval system 3
Introduction
Retrieving information in XML is concerned with all the processes used to capture data,
presentation of the captured data, storage, organization as well as retrieval of the information.
Database is a collection of a correlated data or information kept in an organized manner for the
purpose of easy retrieval (Drakopoulos, & Kanavos, 2016). The entire process of data retrieval is
made up several stages which from data representation stage and runs to the display of results for
the user. It is composed of some intermediate steps which include the search/match, ranking
mechanism in addition to the filtering process. For a search operation to be considered
successful, it must be able to provide results which are mostly relevant (Jukic, Vrbsky, &
Nestorov, 2016).
As a markup language, XML defines the standards and rules used to encode documents
such that they are presented in a more human-readable as well as machine readable formats (Lu,
Liu, & Wu, 2015). Therefore, this has caused the need of sorting huge assortments of XML-
documents as well as to ensure an efficient way of retrieving the relevant information effectively.
The development of both XML besides the web services has been of great help in that it
provides interoperability of the distributed technologies/ technologies (Ma, Bao, Bao, Yuan,
Huang, & Zhao, 2017). However, this has a drawback in that if a component is bonded, it cannot
be used with a different component separately. To overcome such problem, XML based-model is
used for the purpose improving the maintenance overhead (Agarwal, & Ramamritham, 2015).
In this paper, the information retrieval –models are described. Explanation is provided
concerning retrieval system as well as the models being used to find relevant information from
the XML-database.

Retrieval system 4
INFORMATION RETRIEVAL (IR) MODELS
In Information retrieval context, XML is given emphasis only as a tool for encoding data
in form text or documents (Monisha & Vigneshwari, 2015). A feasibly more extensive usage of
XML is for encoding a non-text data. This is for instance, a user may need to transfer data
using an XML -format starting from a resource planning -system so as to read them with an aim
of producing analytical graphs to be used as a presentation (Munir, & Anjum, 2018). This
category of XML application is termed as data-centric for arithmetic and non-text data
dominate while the text constitutes a very small portion of the total data. A database stores the
data-centric XML (Jukic, Vrbsky, & Nestorov, 2016).
Here, development of an Information Retrieval -System (IR) is done either fully or in
partial terms for the XML documents or files. Preprocessing of XML documents is done as well
as generation of the structural terms for each and every document (Yalamanchi, & Perry,
2017).In response to the user –query, ranking method is put in place so as for the documents to
be ranked accordingly. Such results which have been ranked are then presented as an output...
Furthermore, in order to positively improve the process of searching and navigating in
documents, Stopword removal is done so as to eradicate the undesirable memory allocations.
IR model outlines the specifications of presenting the document, query presentation as
well as the retrieval operation. The fundamental classifications of the IR include:
 Boolean model
 The vector model and
 Probabilistic model

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Retrieval system 5
a) Boolean Model
In this model, a combination of keywords is associated to a particular document. The
retrieval operation of the Boolean model takes a document in terms of its relevance or
irrelevance (as relevant or irrelevant) (Munir, & Anjum, 2018). Because of the redundant nature
of the web, it becomes very difficult for assigning priorities to display documents at the time this
model is implemented (Agarwal, & Ramamritham, 2015).
Probabilistic model
This is one of the classifications of the IR models which take the assumption that
there exists a combination of documents representing the model response to handle the
query of the user (Scheevel, Ozor, Hilton, & Collins, 2016).The user selects and monitor
a preliminary document set in this IR model and it is accomplished by the use of an
interactive interface provides an instant feedback to the user as way of achieving the
model solution set.
This model has been categorized basing on the principle of probability ranking.
This principle denotes that relevance probability of the document to the user’s query
should be ranked by the IR (Jukic, Vrbsky, & Nestorov, 2016)...
b) Vector Space Model
Vector space model IR is the most known model. The existing similarity extent
between the documents and the query vector is measured to be used in ranking. It is a
requirement for XML retrieval of documents to consider the present structural-context of

Retrieval system 6
the terms (Yalamanchi, & Perry, 2017). Therefore, the vector space-model is used to
represent such structural context.
XML documents comprise of the data as well as the bond structuring. This is done in
such a way that such data is readable to both machines and people. An XML-document can be
transferred from one party to the other via electronic means and it carries all the information thus
it self-describing (Ling, Zeng, Le, & Lee, 2016). It is because of this feature that XML has
gained much value as far as Service Oriented- Architectures (SOA) or web services are
concerned (Agarwal, & Ramamritham, 2015).
This work is subdivided into the following steps:
 Step1: Stopword Removal
 Step 2: stemming
 Step 3: indexing
 Step 4: ranking
 Step 5: evaluation of the query.
i. Stop words
This is a kind of a noise signal which interject during the process of quickly
ascertaining the meaning, relevance as well as the importance of the words found in a given
document(Jukic, Vrbsky, & Nestorov, 2016). For the message to be clear, such words should be
filtered. Such filtration of stop-words helps in reducing the size of the index or helps the users by

Retrieval system 7
providing search queries to provide some of the best results (Celesti, Fazio, Romano, & Villari,
2016)
ii. Stemming:
This is defined as the practice of decreasing inflected words (or occasionally derived) to
the forms of stem, root or base form. There are two categories of stemming algorithms which are
(Lovins algorithm and porter’s algorithm).
Lovins algorithm stipulates the various suffix- patterns and it embraces a cyclic heuristic
methodology. The design process of the Lovins algorithm was greatly predisposed by technical
vocabularies which Lovins came across during her working (Monisha, & Vigneshwari, 2015).
The Lovins -algorithm is conspicuously larger than the famous Porter algorithm, this is
due to its very widespread endings list, also, and it possesses an advantage in that: it is very
faster. The other advantage of this algorithm is its traded-space for-time and its larger set of
suffix (Celesti, Fazio, Romano, & Villari, 2016).
The Porter’s -algorithm on the other hand is of a simpler type than the Lovins algorithm.
This is because of its reduced number of rules; only 60 rules structured into sets. Conflict
resolution existing in a set, of rules, is done prior to application of another set. Distinct phases
are used to separate such combination of rules and each of the 5 phases removes a particular
word suffix.
iii. Indexing
This step encompasses the approximation of the scattering of all the terms, both tags as
well as literals inside the frequency of the document. Any duplication of any path or term is

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Retrieval system 8
removed to enable the resultant XML summary-tree to contain at most one path at a time.
Consequently, the original document is bigger than the summary tree and the summary-tree
comprehends only the important tags and terms as much as indexing is of concern (Celesti,
Fazio, Romano, & Villari, 2016)
. The word distribution can be additionally used for generating weights for every term of
the summary tree so as to aid the process of ranking.
iv. Ranking
This step is concerned with loading the summary tree to the index-structure and this
includes the splitting-up of the content –data (raw text whose storage takes place in an inverted
file) from the path data (structured text to be stored in the path -index. A path index is
hierarchical tags and it keeps records of each path in the given collection (Lu, Liu & Wu, 2015).
v. Query Evaluation
After preprocessing, generation of the structural terms begins. After generation, the
structural terms are matched with others which are produced or generated by the XML document
for the purpose of evaluating context resemblance (Cornish, 2017).
XML STORAGE TECHNIQUES
Depending on what the document is composed of, XML document may be modeled in a
tree-like manner in a graph-like. It is modeled in a tree-like structure when the XML documents
do not have global links or the internal links, otherwise, when the document possess either the
internal or global links, it will be modeled in a form of a graph-like (Lu, Liu & Wu, 2015).

Retrieval system 9
In a tree structure, the nodes represent XML elements as well as attributes; the edges
represent parent-children relationships. Whereas the, boxes with round – corners denote attribute
and/or text nodes. Therefore, the XML storage techniques include the following: Text Approach,
Relational DTD Approach, Edge Approach, Object Approach and Native XML Storage
Approach (Cornish, 2017).
a. Text Approach
The 1st approach stores every XML document in form of a text file. One of the
ways of implementing any query engine using text approach is parsing an XML- file
into a tree which resides in the memory. This will be used against which to execute the
query. The tree will be retained inside the memory provided that some of the nodes are
required evaluation of the query.
In order make this approach to be competitive, the following approach of
indexing has been adopted: parent offsetting tag to the child offset i.e., applying the XML
element -offset as well as building mapping of a path index.
This approach is faced by a main disadvantage in that every time there is an
update of an XML document, the elements offsets are also changed. These are offsets
which are prior tags. This therefore, causes the indices to be invalidated thus causing the
need of rebuilding (Cornish, 2017).
Concerning concurrency control, there is necessities of locking the XML
document as well as corresponding indices while particular threads are accessing data.
For instance, if a particular thread is construing, the others can also read as well,
however, when a particular thread is performing an update, no other thread is able to
either read or perform any update. It is very challenging when some of the new threads

Retrieval system 10
are continuing to gain access to read the documents but do not have the ability of
updating any section of such document (Drakopoulos, & Kanavos, 2016)
b. The Relational DTD Approach
This is the 2nd strategy is known as shared-inclining method whose requirement is
the existence of the Document Type Definition (DTD). Element declarations in DTD
begin a case sensitive <! ELEMENT> and constitutes the element’s identity and then
followed by content specification. When declaring, this content specification is a
keyword ANY, which is case sensitive.
During the construction of the XML document using this approach; it is essential to
understand to construct the document and its layout. It does not matter the construction is a
partial or full one simply because the work to be done is just the same. However, when it
concerns partial construction, specification has to be made concerning the parts which are to be
constructed (Drakopoulos, & Kanavos, 2016).
c. Edge Approach
This is the 3rd strategy in which a single edge-table is used to store the XML files’
directed graph. In a directed graph, a unique id is assigned to each node.
. Every tuple existing on the Edge- table is a correspondent only a single edge of the
directed-graph. Each of the tuples also comprises of the ids of 2 nodes which are linked by the
tag of the targeted node, edge and the ordinal-number which is used for the purpose of encoding
the children code’s order. The text is said to be in lined if the element has got only a single text
child (Cornish, 2017).

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Retrieval system 11
An index is therefore constructed on (tag, data) for the purpose of reducing the total time
for execution while selecting the queries. The edge table uses a clustering strategy which is has
very significant effect in querying the performance.
This strategy has got a drawback in that the elements having the same name tags
cannot be clustered. Subsequently, queries for instance, "select all scholars are Americans
but born in Turkey” will incur huge numerals of random inputs and outputs (I/Os).
d. The Object Approach
A clear way of ensuring that there is an effective and a good storage for XML documents
within an object manager, is to have each and every element stored is if it was a separate
object (Cornish, 2017).. However, because, by virtue, of the small nature of the XML
documents, it is a mandatory to have all the elements of the XML documents behave their
storage inside a solo object and thus the most of the XML elements tend to become
light-weight entities within the object.
e. Native XML Storage Approach
The native -XML database is a specialized database for storing and processing the XML
documents. Its storage schemas are aimed at efficiently supporting the loading as well as the
storage of complete XML documents. It aims also at an efficient navigation within the
documents. It uses a text-based mapping for storing documents as flat-files (Drakopoulos, &
Kanavos, 2016).
INDEXING TECHNIQUES

Retrieval system 12
To recapitulate the data structure without the presence of any schema in addition to
supporting path expressions- evaluation, abundant structure keys were proposed for a semi -
structured data labeled as follows:
1) Structure Indexes: a structure index I (G) is a summarized graph whose
function is to preserve in entirely, the paths found in the data-graph but then again,
which has a fewer amount of nodes as well as the edges (Cornish, 2017). The major idea
behind this index is that it is dependent on the numbering- schema. It helps to compute 2
numerical values for every particular name of the element found in the XML data-tree.
One number is representing a pre-order while the other number represents the post-order.
And such numerical values are the output of first search depth found in the XML data-
tree.
2) Connection Indexes.
A connection index is an index whose aim is to support XPath- axes which in
position of the wildcards along the path-expressions.
Labeling schemes of a rooted tree which is supporting the ancestor -queries are
currently being developed through various researches. Labeling of the XML-tree nodes is
done in such a way that ancestor –relationship is identified as to whether a particular
label is seen as the others’ prefix. Furthermore, it allows for new nodes to be inserted
while not affecting the existing labels of the existing XML documents (Drakopoulos, &
Kanavos, 2016).

1 out of 18

Comprehensive Report: XML Information Retrieval System and Models

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Related Documents

Report on XML Information Retrieval System: Models and Techniques

Database Issues Report: Analysis and Mitigation Strategies

+13062052269

info@desklib.com

Comprehensive Report: XML Information Retrieval System and Models

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

Report on XML Information Retrieval System: Models and Techniques

Database Issues Report: Analysis and Mitigation Strategies

+13062052269

info@desklib.com