SIT772 Database and Information Retrieval: Inverted Index Report

Verified

Added on  2023/06/12

|11
|1087
|53
Report
AI Summary
This report details the creation of an inverted index from a set of documents, including stop word removal, Porter stemming, and merging the documents. It covers the generation of a dictionary and posting file, followed by testing using boolean and vector queries. The report also includes an information retrieval evaluation comparing Google and Bing search engines based on precision and recall for a specific search target, providing a comparative analysis of their performance. This document is available on Desklib, a platform offering a wealth of study resources for students.
Document Page
COVER PAGE (ENTER YOUR DETAILS)
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
A Remove stop words..........................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
B The three documents merged..............................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency for every document..................................................................................................6
C Dictionary and related posting file....................................................................................................7
D Testing..............................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
A Search engines.....................................................................................................................................9
B Targets..................................................................................................................................................9
Search queries.........................................................................................................................................9
Google search Engine..........................................................................................................................9
Bing Search Engine............................................................................................................................10
C Average for Google and Bing..........................................................................................................11
Bibliography...............................................................................................................................................11
Document Page
Question 1
Creating an inverted index
Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
A Remove stop words
Results
Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
Document 1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
Document 2
Information retrieve find material unstructure nature satisfy information within large collect
Document 3
Information system study complement network hardware software people organ collect filter
process create distribute data
Document Page
B The three documents merged
Term Document
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Sorting words in alphabetical order for all documents
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Document Page
Word frequency for every document
Term Doc ID Frequency
Active 1 1
Base 1 2
collect 2 1
Complement 3 1
Content 1 1
Find 2 1
Full 1 1
Hardware 3 1
index 1 1
Information 1 3
Information 2 2
Information 3 1
Large 2 1
Material 2 1
Nature 2 1
Network 3 1
Obtain 1 1
People 3 1
Relevant 1 1
Resource 1 2
Retrieve 1 1
Retrieve 2 1
Satisfy 2 1
Search 1 1
Software 3 1
Study 3 1
System 3 1
Text 1 1
Unstructure 2 1
Document Page
C Dictionary and related posting file
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
D Testing
Testing the inverted index was done using google search engine and the results returned by the search
engine included the three documents. The keywords used for the search are information, system and
index.
Boolean and vector queries
e. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
f. Vector queries
Getting the cosine similarity based on;
Query Q= (Information, system, index)
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
32+12 +02 12+12+12 = 4
7 3 = 1.15
Document 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
22 +02+ 02 12 +12+12 = 2
4 3 = 0.76
Document 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
12 +12 +02 12+12 +12 = 2
2 3 = 1.07
Document Page
Based on the results of the cosine similarity the order in which the documents appear is a
search engine is;
1) Document 1
2) Document 2
3) Document 3
Question 2 IR evaluation
A Search engines
Google
Bing
B Targets
Target 5: obtain the price of new Xbox one.
Search queries
Query 1= Xbox one price
Query 2= Price xbox one
Google search Engine
Figure 1: Google Search Engine
Document Page
Bing Search Engine
Figure 2: Yahoo search engine
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
C Average for Google and Bing
Figure 3: Comparison by average
Based on the chart shown on figure 3 above Google is more powerful than Bing based on precision and
recall. Google has a higher recall value meaning that the number of results is greater than Bing. Google
has a precision meaning that the number of results related to the search query is higher compared to
Bing.
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
chevron_up_icon
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]