XML Information Retrieval System

Verified

Added on  2023/01/16

|18
|4789
|31
AI Summary
This paper reports on the development of an XML retrieval system and the models used to find relevant information from the XML database. It discusses the process of capturing, storing, organizing, and retrieving information in XML, as well as the different IR models such as the Boolean model, vector model, and probabilistic model. The paper also explores the steps involved in the retrieval process, including stopword removal, stemming, indexing, ranking, and query evaluation. Additionally, it discusses different XML storage techniques such as the text approach, relational DTD approach, edge approach, object approach, and native XML storage approach.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running head: INFORMATION RETRIEVAL IN XML BASED DATABASE 1
XML INFORMATION RETRIEVAL SYSTEM
Student’s name
Institutional
Date

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Retrieval system 2
Abstract
The purpose of this paper is to report on development of extensible Markup-Language
(XML) retrieval system that has matched an extensive development of the XML –information
repositories. It has emerged as the dominant standard used in representing data as well as
exchanging it over the global network.
Document Page
Retrieval system 3
Introduction
Retrieving information in XML is concerned with all the processes used to capture data,
presentation of the captured data, storage, organization as well as retrieval of the information.
Database is a collection of a correlated data or information kept in an organized manner for the
purpose of easy retrieval (Drakopoulos, & Kanavos, 2016). The entire process of data retrieval is
made up several stages which from data representation stage and runs to the display of results for
the user. It is composed of some intermediate steps which include the search/match, ranking
mechanism in addition to the filtering process. For a search operation to be considered
successful, it must be able to provide results which are mostly relevant (Jukic, Vrbsky, &
Nestorov, 2016).
As a markup language, XML defines the standards and rules used to encode documents
such that they are presented in a more human-readable as well as machine readable formats (Lu,
Liu, & Wu, 2015). Therefore, this has caused the need of sorting huge assortments of XML-
documents as well as to ensure an efficient way of retrieving the relevant information effectively.
The development of both XML besides the web services has been of great help in that it
provides interoperability of the distributed technologies/ technologies (Ma, Bao, Bao, Yuan,
Huang, & Zhao, 2017). However, this has a drawback in that if a component is bonded, it cannot
be used with a different component separately. To overcome such problem, XML based-model is
used for the purpose improving the maintenance overhead (Agarwal, & Ramamritham, 2015).
In this paper, the information retrieval –models are described. Explanation is provided
concerning retrieval system as well as the models being used to find relevant information from
the XML-database.
Document Page
Retrieval system 4
INFORMATION RETRIEVAL (IR) MODELS
In Information retrieval context, XML is given emphasis only as a tool for encoding data
in form text or documents (Monisha & Vigneshwari, 2015). A feasibly more extensive usage of
XML is for encoding a non-text data. This is for instance, a user may need to transfer data
using an XML -format starting from a resource planning -system so as to read them with an aim
of producing analytical graphs to be used as a presentation (Munir, & Anjum, 2018). This
category of XML application is termed as data-centric for arithmetic and non-text data
dominate while the text constitutes a very small portion of the total data. A database stores the
data-centric XML (Jukic, Vrbsky, & Nestorov, 2016).
Here, development of an Information Retrieval -System (IR) is done either fully or in
partial terms for the XML documents or files. Preprocessing of XML documents is done as well
as generation of the structural terms for each and every document (Yalamanchi, & Perry,
2017).In response to the user –query, ranking method is put in place so as for the documents to
be ranked accordingly. Such results which have been ranked are then presented as an output...
Furthermore, in order to positively improve the process of searching and navigating in
documents, Stopword removal is done so as to eradicate the undesirable memory allocations.
IR model outlines the specifications of presenting the document, query presentation as
well as the retrieval operation. The fundamental classifications of the IR include:
Boolean model
The vector model and
Probabilistic model

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Retrieval system 5
a) Boolean Model
In this model, a combination of keywords is associated to a particular document. The
retrieval operation of the Boolean model takes a document in terms of its relevance or
irrelevance (as relevant or irrelevant) (Munir, & Anjum, 2018). Because of the redundant nature
of the web, it becomes very difficult for assigning priorities to display documents at the time this
model is implemented (Agarwal, & Ramamritham, 2015).
Probabilistic model
This is one of the classifications of the IR models which take the assumption that
there exists a combination of documents representing the model response to handle the
query of the user (Scheevel, Ozor, Hilton, & Collins, 2016).The user selects and monitor
a preliminary document set in this IR model and it is accomplished by the use of an
interactive interface provides an instant feedback to the user as way of achieving the
model solution set.
This model has been categorized basing on the principle of probability ranking.
This principle denotes that relevance probability of the document to the user’s query
should be ranked by the IR (Jukic, Vrbsky, & Nestorov, 2016)...
b) Vector Space Model
Vector space model IR is the most known model. The existing similarity extent
between the documents and the query vector is measured to be used in ranking. It is a
requirement for XML retrieval of documents to consider the present structural-context of
Document Page
Retrieval system 6
the terms (Yalamanchi, & Perry, 2017). Therefore, the vector space-model is used to
represent such structural context.
XML documents comprise of the data as well as the bond structuring. This is done in
such a way that such data is readable to both machines and people. An XML-document can be
transferred from one party to the other via electronic means and it carries all the information thus
it self-describing (Ling, Zeng, Le, & Lee, 2016). It is because of this feature that XML has
gained much value as far as Service Oriented- Architectures (SOA) or web services are
concerned (Agarwal, & Ramamritham, 2015).
This work is subdivided into the following steps:
Step1: Stopword Removal
Step 2: stemming
Step 3: indexing
Step 4: ranking
Step 5: evaluation of the query.
i. Stop words
This is a kind of a noise signal which interject during the process of quickly
ascertaining the meaning, relevance as well as the importance of the words found in a given
document(Jukic, Vrbsky, & Nestorov, 2016). For the message to be clear, such words should be
filtered. Such filtration of stop-words helps in reducing the size of the index or helps the users by
Document Page
Retrieval system 7
providing search queries to provide some of the best results (Celesti, Fazio, Romano, & Villari,
2016)
ii. Stemming:
This is defined as the practice of decreasing inflected words (or occasionally derived) to
the forms of stem, root or base form. There are two categories of stemming algorithms which are
(Lovins algorithm and porter’s algorithm).
Lovins algorithm stipulates the various suffix- patterns and it embraces a cyclic heuristic
methodology. The design process of the Lovins algorithm was greatly predisposed by technical
vocabularies which Lovins came across during her working (Monisha, & Vigneshwari, 2015).
The Lovins -algorithm is conspicuously larger than the famous Porter algorithm, this is
due to its very widespread endings list, also, and it possesses an advantage in that: it is very
faster. The other advantage of this algorithm is its traded-space for-time and its larger set of
suffix (Celesti, Fazio, Romano, & Villari, 2016).
The Porter’s -algorithm on the other hand is of a simpler type than the Lovins algorithm.
This is because of its reduced number of rules; only 60 rules structured into sets. Conflict
resolution existing in a set, of rules, is done prior to application of another set. Distinct phases
are used to separate such combination of rules and each of the 5 phases removes a particular
word suffix.
iii. Indexing
This step encompasses the approximation of the scattering of all the terms, both tags as
well as literals inside the frequency of the document. Any duplication of any path or term is

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Retrieval system 8
removed to enable the resultant XML summary-tree to contain at most one path at a time.
Consequently, the original document is bigger than the summary tree and the summary-tree
comprehends only the important tags and terms as much as indexing is of concern (Celesti,
Fazio, Romano, & Villari, 2016)
. The word distribution can be additionally used for generating weights for every term of
the summary tree so as to aid the process of ranking.
iv. Ranking
This step is concerned with loading the summary tree to the index-structure and this
includes the splitting-up of the content –data (raw text whose storage takes place in an inverted
file) from the path data (structured text to be stored in the path -index. A path index is
hierarchical tags and it keeps records of each path in the given collection (Lu, Liu & Wu, 2015).
v. Query Evaluation
After preprocessing, generation of the structural terms begins. After generation, the
structural terms are matched with others which are produced or generated by the XML document
for the purpose of evaluating context resemblance (Cornish, 2017).
XML STORAGE TECHNIQUES
Depending on what the document is composed of, XML document may be modeled in a
tree-like manner in a graph-like. It is modeled in a tree-like structure when the XML documents
do not have global links or the internal links, otherwise, when the document possess either the
internal or global links, it will be modeled in a form of a graph-like (Lu, Liu & Wu, 2015).
Document Page
Retrieval system 9
In a tree structure, the nodes represent XML elements as well as attributes; the edges
represent parent-children relationships. Whereas the, boxes with round – corners denote attribute
and/or text nodes. Therefore, the XML storage techniques include the following: Text Approach,
Relational DTD Approach, Edge Approach, Object Approach and Native XML Storage
Approach (Cornish, 2017).
a. Text Approach
The 1st approach stores every XML document in form of a text file. One of the
ways of implementing any query engine using text approach is parsing an XML- file
into a tree which resides in the memory. This will be used against which to execute the
query. The tree will be retained inside the memory provided that some of the nodes are
required evaluation of the query.
In order make this approach to be competitive, the following approach of
indexing has been adopted: parent offsetting tag to the child offset i.e., applying the XML
element -offset as well as building mapping of a path index.
This approach is faced by a main disadvantage in that every time there is an
update of an XML document, the elements offsets are also changed. These are offsets
which are prior tags. This therefore, causes the indices to be invalidated thus causing the
need of rebuilding (Cornish, 2017).
Concerning concurrency control, there is necessities of locking the XML
document as well as corresponding indices while particular threads are accessing data.
For instance, if a particular thread is construing, the others can also read as well,
however, when a particular thread is performing an update, no other thread is able to
either read or perform any update. It is very challenging when some of the new threads
Document Page
Retrieval system 10
are continuing to gain access to read the documents but do not have the ability of
updating any section of such document (Drakopoulos, & Kanavos, 2016)
b. The Relational DTD Approach
This is the 2nd strategy is known as shared-inclining method whose requirement is
the existence of the Document Type Definition (DTD). Element declarations in DTD
begin a case sensitive <! ELEMENT> and constitutes the element’s identity and then
followed by content specification. When declaring, this content specification is a
keyword ANY, which is case sensitive.
During the construction of the XML document using this approach; it is essential to
understand to construct the document and its layout. It does not matter the construction is a
partial or full one simply because the work to be done is just the same. However, when it
concerns partial construction, specification has to be made concerning the parts which are to be
constructed (Drakopoulos, & Kanavos, 2016).
c. Edge Approach
This is the 3rd strategy in which a single edge-table is used to store the XML files’
directed graph. In a directed graph, a unique id is assigned to each node.
. Every tuple existing on the Edge- table is a correspondent only a single edge of the
directed-graph. Each of the tuples also comprises of the ids of 2 nodes which are linked by the
tag of the targeted node, edge and the ordinal-number which is used for the purpose of encoding
the children code’s order. The text is said to be in lined if the element has got only a single text
child (Cornish, 2017).

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Retrieval system 11
An index is therefore constructed on (tag, data) for the purpose of reducing the total time
for execution while selecting the queries. The edge table uses a clustering strategy which is has
very significant effect in querying the performance.
This strategy has got a drawback in that the elements having the same name tags
cannot be clustered. Subsequently, queries for instance, "select all scholars are Americans
but born in Turkey” will incur huge numerals of random inputs and outputs (I/Os).
d. The Object Approach
A clear way of ensuring that there is an effective and a good storage for XML documents
within an object manager, is to have each and every element stored is if it was a separate
object (Cornish, 2017).. However, because, by virtue, of the small nature of the XML
documents, it is a mandatory to have all the elements of the XML documents behave their
storage inside a solo object and thus the most of the XML elements tend to become
light-weight entities within the object.
e. Native XML Storage Approach
The native -XML database is a specialized database for storing and processing the XML
documents. Its storage schemas are aimed at efficiently supporting the loading as well as the
storage of complete XML documents. It aims also at an efficient navigation within the
documents. It uses a text-based mapping for storing documents as flat-files (Drakopoulos, &
Kanavos, 2016).
INDEXING TECHNIQUES
Document Page
Retrieval system 12
To recapitulate the data structure without the presence of any schema in addition to
supporting path expressions- evaluation, abundant structure keys were proposed for a semi -
structured data labeled as follows:
1) Structure Indexes: a structure index I (G) is a summarized graph whose
function is to preserve in entirely, the paths found in the data-graph but then again,
which has a fewer amount of nodes as well as the edges (Cornish, 2017). The major idea
behind this index is that it is dependent on the numbering- schema. It helps to compute 2
numerical values for every particular name of the element found in the XML data-tree.
One number is representing a pre-order while the other number represents the post-order.
And such numerical values are the output of first search depth found in the XML data-
tree.
2) Connection Indexes.
A connection index is an index whose aim is to support XPath- axes which in
position of the wildcards along the path-expressions.
Labeling schemes of a rooted tree which is supporting the ancestor -queries are
currently being developed through various researches. Labeling of the XML-tree nodes is
done in such a way that ancestor –relationship is identified as to whether a particular
label is seen as the others’ prefix. Furthermore, it allows for new nodes to be inserted
while not affecting the existing labels of the existing XML documents (Drakopoulos, &
Kanavos, 2016).
Document Page
Retrieval system 13
These indexes are used for defining the process of allocating binary -strings to the
edges of a tree, in such a way that, the assembling of strings related to the departing
edges from whichever node is prefix- free (a prefix free assignment).
3) Path Indexes
A path index is defined as an index which concerned with supporting the route-finding
XPath-axes (includes the parent, the child(s), descendants-or-self, ancestors-or-self, descendants
as well as the ancestors).
These path indexes greatly vary in utilization of the space, how they support paths having
wildcards (means the indiscriminate extensive paths starting from the source- point all the way to
the targets inside the XML graph) (Lu, Liu & Wu, 2015). These path indexes are largely
dependent on the structure sum-ups of the given XML graph. Such structure- summary is a very
significant technique used for indexing an XML arbitrary -graph, in a situation where the
general-schema, of information, is not available (Drakopoulos, & Kanavos, 2016).
Summary
The main advantage of an Information Retrieval system (IR) is that it returns numerous
documents for the information being sought on the on the global network. Thus manipulating
relevance is one the key factors that assures that documents which are most likely to contain
solutions for the queries pushed up during the end ranking. XML IR –systems apply a simple,
but a faster Boolean- model for the term expression to discover the matching documents
(Cornish, 2017). It uses both the qualitative ranking-algorithm as well as the ability of the user
to make a query rephrase so as to give the most valuable services to the users with the aim of
satisfying the information-need (Celesti, Fazio, Romano, & Villari, 2016).

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Retrieval system 14
Some of the challenges that face the IR may be summarized into extended queries,
passage retrieval, heterogeneity of the schema, focused retrieval .in order to improve the
performance of IR, Joint clustering compliment algorithms can help.
Conclusion
XML importance is advantageous in the process designing a web. Numerous Information
Retrieval (IR) systems have been analyzed in details. It does a very decent task of providing
self-describing data feedbacks hence it has turned out to be the key standard for the Service
Oriented -Architectures (SOA) as well as Web Services. It also, provides a sophisticated
solution. Appropriate information retrieval (IR) model is aimed at retrieving the relevant
information (Agarwal, & Ramamritham, 2015).
Tag transformation can be of much for the purpose of handling variant words by the use of a
sound-like technique. To further facilitate regularity, prefixes as well as suffixes are eliminated
from the tags.
Moreover, to further increase the accuracy of the retrieving XML information, methods to
handle propinquity trials between the tags should be devised (Ayzenshtadt, et al, 2016).
Document Page
Retrieval system 15
References
Amjad, T., Ding, Y., Daud, A., Xu, J., & Malic, V. (2015). Topic-based heterogeneous rank.
Scientometrics, 104(1), 313-334.
Ayzenshtadt, V., Langenhan, C., Roith, J., Bukhari, S., Althoff, K. D., Petzold, F., & Dengel, A.
(2016, October). Comparative evaluation of rule-based and case-based retrieval
coordination for search of architectural building designs. In International Conference on
Case-Based Reasoning (pp. 16-31). Springer, Cham.
Bousalem, Z., & Cherti, I. (2015). XMap: A Novel Approach to Store and Retrieve XML
Document in Relational Databases. JSW, 10(12), 1389-1401.
Büttcher, S., Clarke, C. L., & Cormack, G. V. (2016). Information retrieval: Implementing and
evaluating search engines. Mit Press.
Cornish, A. (2017). Using a native XML database for Encoded Archival Description search and
retrieval. Information Technology and Libraries, 23(4), 181-184.
Das, M., Cheng, J. C., & Kumar, S. S. (2015). Social BIMCloud: a distributed cloud-based BIM
platform for object-based lifecycle information exchange. Visualization in Engineering,
3(1), 8.
Hwang, K. H., Lee, H., Koh, G., Willrett, D., & Rubin, D. L. (2017). Building and querying
rdf/owl database of semantically annotated nuclear medicine images. Journal of digital
imaging, 30(1), 4-10..
Document Page
Retrieval system 16
Jeong, S., Hou, R., Lynch, J. P., Sohn, H., & Law, K. H. (2017). An information modeling
framework for bridge monitoring. Advances in engineering software, 114, 11-31.
Jeong, S., Hou, R., Lynch, J. P., Sohn, H., & Law, K. H. (2017). An information modeling
framework for bridge monitoring. Advances in engineering software, 114, 11-31.
Jukic, N., Vrbsky, S., & Nestorov, S. (2016). Database systems: Introduction to databases and
data warehouses. Prospect Press.
Kim, S., Thiessen, P. A., Bolton, E. E., & Bryant, S. H. (2015). PUG-SOAP and PUG-REST:
web services for programmatic access to chemical information in PubChem. Nucleic
acids research, 43(W1), W605-W611.
Lu, C., Liu, M., & Wu, Z. (2015). Svql: A sql extended query language for video databases.
International Journal of Database Theory and Application, 8(3), 235-248.
Ma, L., Bao, W., Bao, W., Yuan, W., Huang, T., & Zhao, X. (2017, January). A Mongolian
information retrieval system based on Solr. In 2017 9th International Conference on
Measuring Technology and Mechatronics Automation (ICMTMA) (pp. 335-338). IEEE.
Marketakis, Y., Minadakis, N., Kondylakis, H., Konsolaki, K., Samaritakis, G., Theodoridou,
M., ... & Doerr, M. (2017). X3ML mapping framework for information integration in
cultural heritage and beyond. International Journal on Digital Libraries, 18(4), 301-319.
Monisha, S., & Vigneshwari, S. (2015). A FRAMEWORK FOR ONTOLOGY BASED LINK
ANALYSIS FOR WEB MINING. Journal of Theoretical & Applied Information
Technology, 73(2).

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Retrieval system 17
Munir, K., & Anjum, M. S. (2018). The use of ontologies for effective knowledge modelling and
information retrieval. Applied Computing and Informatics, 14(2), 116-126.
Ougouti, N. S., Belbachir, H., & Amghar, Y. (2015). A new owl2 based approach for relational
database description. International Journal of Information Technology and Computer
Science, 7(1), 48-53.
Piernik, M., Brzezinski, D., & Morzy, T. (2016). Clustering XML documents by patterns.
Knowledge and Information Systems, 46(1), 185-212.
Sabri, Q. U., Bayer, J., Ayzenshtadt, V., Bukhari, S. S., Althoff, K. D., & Dengel, A. (2017,
February). Semantic Pattern-based Retrieval of Architectural Floor Plans with Case-
based and Graph-based Searching Techniques and their Evaluation and Visualization. In
ICPRAM (pp. 50-60).
Scheevel, M. R., Ozor, T. A., Hilton, G. S., & Collins, J. M. (2016). U.S. Patent No. 9,305,033.
Washington, DC: U.S. Patent and Trademark Office.
Stanković, R., Krstev, C., Vitas, D., Vulović, N., & Kitanović, O. (2016, September). Keyword-
based search on bilingual digital libraries. In Semanitic Keyword-based Search on
Structured Data Sources (pp. 112-123). Springer, Cham.
Yalamanchi, A., & Perry, M. (2017). U.S. Patent No. 9,805,076. Washington, DC: U.S. Patent
and Trademark Office.
Agarwal, M. K., & Ramamritham, K. (2015, April). Enabling generic keyword search over raw
XML data. In 2015 IEEE 31st International Conference on Data Engineering (pp. 1496-
1499). IEEE..
Document Page
Retrieval system 18
Sikos, L. F., & Powers, D. M. (2015, October). Knowledge-driven video information retrieval
with LOD: from semi-structured to structured video metadata. In Proceedings of the
Eighth Workshop on Exploiting Semantic Annotations in Information Retrieval (pp. 35-
37). ACM.
Celesti, A., Fazio, M., Romano, A., & Villari, M. (2016, May). A hospital cloud-based archival
information system for the efficient management of HL7 big data. In 2016 39th
International Convention on Information and Communication Technology, Electronics
and Microelectronics (MIPRO) (pp. 406-411). IEEE.
Ling, T. W., Zeng, Z., Le, T. N., & Lee, M. L. (2016, May). ORA-semantics based keyword
search in XML and relational databases. In 2016 IEEE 32nd International Conference on
Data Engineering Workshops (ICDEW) (pp. 157-160). IEEE.
Drakopoulos, G., & Kanavos, A. (2016, July). Tensor-based document retrieval over Neo4j with
an application to PubMed mining. In 2016 7th International Conference on Information,
Intelligence, Systems & Applications (IISA) (pp. 1-6). IEEE.
Saha, S., Parbat, T., & Neogy, S. (2017, January). Designing a secure data retrieval strategy
using NoSQL database. In International Conference on Distributed Computing and
Internet Technology (pp. 235-238). Springer, Cham.
1 out of 18
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]