SIT772 Database and Information Retrieval: Inverted Index Report
VerifiedAdded on 2023/06/12
|11
|1087
|53
Report
AI Summary
This report details the creation of an inverted index from a set of documents, including stop word removal, Porter stemming, and merging the documents. It covers the generation of a dictionary and posting file, followed by testing using boolean and vector queries. The report also includes an information retrieval evaluation comparing Google and Bing search engines based on precision and recall for a specific search target, providing a comparative analysis of their performance. This document is available on Desklib, a platform offering a wealth of study resources for students.

COVER PAGE (ENTER YOUR DETAILS)
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
A Remove stop words..........................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
B The three documents merged..............................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency for every document..................................................................................................6
C Dictionary and related posting file....................................................................................................7
D Testing..............................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
A Search engines.....................................................................................................................................9
B Targets..................................................................................................................................................9
Search queries.........................................................................................................................................9
Google search Engine..........................................................................................................................9
Bing Search Engine............................................................................................................................10
C Average for Google and Bing..........................................................................................................11
Bibliography...............................................................................................................................................11
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
A Remove stop words..........................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
B The three documents merged..............................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency for every document..................................................................................................6
C Dictionary and related posting file....................................................................................................7
D Testing..............................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
A Search engines.....................................................................................................................................9
B Targets..................................................................................................................................................9
Search queries.........................................................................................................................................9
Google search Engine..........................................................................................................................9
Bing Search Engine............................................................................................................................10
C Average for Google and Bing..........................................................................................................11
Bibliography...............................................................................................................................................11

Question 1
Creating an inverted index
Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
A Remove stop words
Results
Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
Document 1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
Document 2
Information retrieve find material unstructure nature satisfy information within large collect
Document 3
Information system study complement network hardware software people organ collect filter
process create distribute data
Creating an inverted index
Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
A Remove stop words
Results
Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
Document 1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
Document 2
Information retrieve find material unstructure nature satisfy information within large collect
Document 3
Information system study complement network hardware software people organ collect filter
process create distribute data

B The three documents merged
Term Document
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Term Document
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Sorting words in alphabetical order for all documents
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2

Word frequency for every document
Term Doc ID Frequency
Active 1 1
Base 1 2
collect 2 1
Complement 3 1
Content 1 1
Find 2 1
Full 1 1
Hardware 3 1
index 1 1
Information 1 3
Information 2 2
Information 3 1
Large 2 1
Material 2 1
Nature 2 1
Network 3 1
Obtain 1 1
People 3 1
Relevant 1 1
Resource 1 2
Retrieve 1 1
Retrieve 2 1
Satisfy 2 1
Search 1 1
Software 3 1
Study 3 1
System 3 1
Text 1 1
Unstructure 2 1
Term Doc ID Frequency
Active 1 1
Base 1 2
collect 2 1
Complement 3 1
Content 1 1
Find 2 1
Full 1 1
Hardware 3 1
index 1 1
Information 1 3
Information 2 2
Information 3 1
Large 2 1
Material 2 1
Nature 2 1
Network 3 1
Obtain 1 1
People 3 1
Relevant 1 1
Resource 1 2
Retrieve 1 1
Retrieve 2 1
Satisfy 2 1
Search 1 1
Software 3 1
Study 3 1
System 3 1
Text 1 1
Unstructure 2 1

C Dictionary and related posting file
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

D Testing
Testing the inverted index was done using google search engine and the results returned by the search
engine included the three documents. The keywords used for the search are information, system and
index.
Boolean and vector queries
e. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
f. Vector queries
Getting the cosine similarity based on;
Query Q= (Information, system, index)
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Testing the inverted index was done using google search engine and the results returned by the search
engine included the three documents. The keywords used for the search are information, system and
index.
Boolean and vector queries
e. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
f. Vector queries
Getting the cosine similarity based on;
Query Q= (Information, system, index)
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07

Based on the results of the cosine similarity the order in which the documents appear is a
search engine is;
1) Document 1
2) Document 2
3) Document 3
Question 2 IR evaluation
A Search engines
Google
Bing
B Targets
Target 5: obtain the price of new Xbox one.
Search queries
Query 1= Xbox one price
Query 2= Price xbox one
Google search Engine
Figure 1: Google Search Engine
search engine is;
1) Document 1
2) Document 2
3) Document 3
Question 2 IR evaluation
A Search engines
Bing
B Targets
Target 5: obtain the price of new Xbox one.
Search queries
Query 1= Xbox one price
Query 2= Price xbox one
Google search Engine
Figure 1: Google Search Engine

Bing Search Engine
Figure 2: Yahoo search engine
Figure 2: Yahoo search engine
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

C Average for Google and Bing
Figure 3: Comparison by average
Based on the chart shown on figure 3 above Google is more powerful than Bing based on precision and
recall. Google has a higher recall value meaning that the number of results is greater than Bing. Google
has a precision meaning that the number of results related to the search query is higher compared to
Bing.
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
Figure 3: Comparison by average
Based on the chart shown on figure 3 above Google is more powerful than Bing based on precision and
recall. Google has a higher recall value meaning that the number of results is greater than Bing. Google
has a precision meaning that the number of results related to the search query is higher compared to
Bing.
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
1 out of 11
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.