SIT772 Database and Information Retrieval - Indexing and Queries
VerifiedAdded on 2023/06/15
|11
|1154
|271
Report
AI Summary
This report delves into the creation and testing of an inverted index using three topics: Science, Computer Vision, and Search Engines. It details the steps of stop word removal, Porter stemming, and the construction of a dictionary and posting file. The report further explores boolean and vector querie...

COVER PAGE
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.........................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: Normalized tokens sorted in alphabetical order......................................................................4
Step 2: Merge terms appearing more than once.................................................................................5
B dictionary and related posting file........................................................................................................6
c. Testing..............................................................................................................................................7
2 Boolean and vector queries..................................................................................................................7
a. Boolean queries...........................................................................................................................7
b. Vector model (cosine similarity)..................................................................................................7
Question 2...................................................................................................................................................8
Search engines.........................................................................................................................................8
My Target................................................................................................................................................8
Search queries.........................................................................................................................................8
Google Search engine..........................................................................................................................9
Bing search engine.............................................................................................................................10
Average for Google and Bing.............................................................................................................11
Bibliography...............................................................................................................................................11
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.........................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: Normalized tokens sorted in alphabetical order......................................................................4
Step 2: Merge terms appearing more than once.................................................................................5
B dictionary and related posting file........................................................................................................6
c. Testing..............................................................................................................................................7
2 Boolean and vector queries..................................................................................................................7
a. Boolean queries...........................................................................................................................7
b. Vector model (cosine similarity)..................................................................................................7
Question 2...................................................................................................................................................8
Search engines.........................................................................................................................................8
My Target................................................................................................................................................8
Search queries.........................................................................................................................................8
Google Search engine..........................................................................................................................9
Bing search engine.............................................................................................................................10
Average for Google and Bing.............................................................................................................11
Bibliography...............................................................................................................................................11

Question 1
Inverted index
The three selected topics are;
Science
Computer vision
Search Engine
Document for each topic
Search engine –DOC1
The Union of Concerned Scientists puts rigorous, independent science to work to solve our
planet's most pressing problems
Computer vision- DOC2
Computer vision systems are implemented in a wide range of industrial and scientific
applications
Search Engine – (DOC3)
Companies that advertise on regular search engines like Google have to pay a lot of money per
click for the keyword that is related to their product or service
Creating the inverted index
a. Stop words removal
Search engine –DOC1
The Union Concerned Scientists puts rigorous, independent science solve planet's pressing
problems
Computer vision- DOC2
Computer vision systems implemented wide range industrial scientific applications
Search Engine – (DOC3)
Companies advertise regular search engines like Google pay money per click keyword related
product service
Applying Porter Stemming algorithm
Search engine –DOC1
Union Concern Science put rigor depend science solve planet press problem
Computer vision- DOC2
Compute vision system implement wide range industry science apply
Search Engine – (DOC3)
Company advertise regular search engine like Google pay money per click keyword relate
product service
Inverted index
The three selected topics are;
Science
Computer vision
Search Engine
Document for each topic
Search engine –DOC1
The Union of Concerned Scientists puts rigorous, independent science to work to solve our
planet's most pressing problems
Computer vision- DOC2
Computer vision systems are implemented in a wide range of industrial and scientific
applications
Search Engine – (DOC3)
Companies that advertise on regular search engines like Google have to pay a lot of money per
click for the keyword that is related to their product or service
Creating the inverted index
a. Stop words removal
Search engine –DOC1
The Union Concerned Scientists puts rigorous, independent science solve planet's pressing
problems
Computer vision- DOC2
Computer vision systems implemented wide range industrial scientific applications
Search Engine – (DOC3)
Companies advertise regular search engines like Google pay money per click keyword related
product service
Applying Porter Stemming algorithm
Search engine –DOC1
Union Concern Science put rigor depend science solve planet press problem
Computer vision- DOC2
Compute vision system implement wide range industry science apply
Search Engine – (DOC3)
Company advertise regular search engine like Google pay money per click keyword relate
product service
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Steps followed to create the inverted index
Step 1: Normalized tokens sorted in alphabetical order
Term Doc ID
Advertise 3
apply 2
Click 3
Company 3
Compute 2
concern 1
Depend 1
Engine 3
Google 3
Implement 2
Industry 2
Keyword 3
Like 3
Money 3
Pay 3
Per 3
Planet 1
Press 1
problem 1
Product 3
Put 1
Range 2
Regular 3
Relate 3
Rigor 1
science 1
Science 1
Science 2
Search 3
service 3
Solve 1
System 2
Union 1
Vision 2
Wide 2
Step 2: Merge terms appearing more than once
Term Frequency Doc ID
Step 1: Normalized tokens sorted in alphabetical order
Term Doc ID
Advertise 3
apply 2
Click 3
Company 3
Compute 2
concern 1
Depend 1
Engine 3
Google 3
Implement 2
Industry 2
Keyword 3
Like 3
Money 3
Pay 3
Per 3
Planet 1
Press 1
problem 1
Product 3
Put 1
Range 2
Regular 3
Relate 3
Rigor 1
science 1
Science 1
Science 2
Search 3
service 3
Solve 1
System 2
Union 1
Vision 2
Wide 2
Step 2: Merge terms appearing more than once
Term Frequency Doc ID
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Advertise 1 3
apply 1 2
Click 1 3
Company 1 3
Compute 1 2
concern 1 1
Depend 1 1
Engine 1 3
Google 1 3
Implement 1 2
Industry 1 2
Keyword 1 3
Like 1 3
Money 1 3
Pay 1 3
Per 1 3
Planet 1 1
Press 1 1
problem 1 1
Product 1 3
Put 1 1
Range 1 2
Regular 1 3
Relate 1 3
Rigor 1 1
science 2 1
Science 1 2
Search 1 3
service 1 3
Solve 1 1
System 1 2
Union 1 1
Vision 1 2
Wide 1 2
apply 1 2
Click 1 3
Company 1 3
Compute 1 2
concern 1 1
Depend 1 1
Engine 1 3
Google 1 3
Implement 1 2
Industry 1 2
Keyword 1 3
Like 1 3
Money 1 3
Pay 1 3
Per 1 3
Planet 1 1
Press 1 1
problem 1 1
Product 1 3
Put 1 1
Range 1 2
Regular 1 3
Relate 1 3
Rigor 1 1
science 2 1
Science 1 2
Search 1 3
service 1 3
Solve 1 1
System 1 2
Union 1 1
Vision 1 2
Wide 1 2

B dictionary and related posting file
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

c. Testing
Testing using the inverted index returned documents that are related to the three documents for each
topic. Testing was done using Google where most of the results were related to the three topics.
2 Boolean and vector queries
a. Boolean queries
Query 1: (Science Ʌ System Ʌ ¬Advertise)
Returns: Doc 1 and Doc 2
Query 2: ((Science V Sceince) Ʌ Advertise)
Returns: DOC1, DOC2 and DOC3
Query 3: (product Ʌ search Ʌ relate Ʌ (¬website V ¬vision))
Returns: DOC3 only.
b. Vector model (cosine similarity)
Query (science, industry, search, search)
For each document (science, industry, industry)
This results to three dimensions (science, industry, search)
Calculation of cosine similarity
For D1
D1=<science, industry, industry> = <1, 0, 0>
Q=<science, industry, search, search> = <1, 1, 2>
σ ( D1 , Q)= 1 x 1+0 x 1+ 0 x 0
√12 +02 +02 √12+12 +22 = 1
√1 √6 = 0.41
Thus D1=0.41
For D2
D=<data, replace, invest> = <1, 1, 0>
Q=<data, replace, invest, invest> = <1, 1, 2>
σ ( D2 , Q)= 1 x 1+1 x 1+0 x 2
√12 +12 +02 √12 +12 +22 = 2
√ 2 √ 6 = 0.58
Thus D2=0.58
Testing using the inverted index returned documents that are related to the three documents for each
topic. Testing was done using Google where most of the results were related to the three topics.
2 Boolean and vector queries
a. Boolean queries
Query 1: (Science Ʌ System Ʌ ¬Advertise)
Returns: Doc 1 and Doc 2
Query 2: ((Science V Sceince) Ʌ Advertise)
Returns: DOC1, DOC2 and DOC3
Query 3: (product Ʌ search Ʌ relate Ʌ (¬website V ¬vision))
Returns: DOC3 only.
b. Vector model (cosine similarity)
Query (science, industry, search, search)
For each document (science, industry, industry)
This results to three dimensions (science, industry, search)
Calculation of cosine similarity
For D1
D1=<science, industry, industry> = <1, 0, 0>
Q=<science, industry, search, search> = <1, 1, 2>
σ ( D1 , Q)= 1 x 1+0 x 1+ 0 x 0
√12 +02 +02 √12+12 +22 = 1
√1 √6 = 0.41
Thus D1=0.41
For D2
D=<data, replace, invest> = <1, 1, 0>
Q=<data, replace, invest, invest> = <1, 1, 2>
σ ( D2 , Q)= 1 x 1+1 x 1+0 x 2
√12 +12 +02 √12 +12 +22 = 2
√ 2 √ 6 = 0.58
Thus D2=0.58
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

For D3
D1=<science, industry, industry> = <1, 0, 0>
Q=<science, industry, search, search> = <1, 1, 2>
σ ( D1 , Q)= 1 x 1+0 x 1+ 0 x 0
√12 +02 +02 √12+12 +22 = 1
√1 √6 = 0.41
Thus D3=0.41
According to the calculation of cosine similarity of each document with the query the order
of the documents if the query is ran will be;
D2D1D3
Vector space model using cosine similarity is a more effective method of showing which
documents are fetched by a search engine as compared to the Boolean model. This is
because vector space model shows the order in which the documents are fetched where by
the document with the highest cosine similarity is fetched first.
Question 2
Search engines
Google
Bing
My Target
Obtain install document of MongDB
Search queries
Query 1= MongoDB install document
Query 2= mongoDB (installation, setup) document
D1=<science, industry, industry> = <1, 0, 0>
Q=<science, industry, search, search> = <1, 1, 2>
σ ( D1 , Q)= 1 x 1+0 x 1+ 0 x 0
√12 +02 +02 √12+12 +22 = 1
√1 √6 = 0.41
Thus D3=0.41
According to the calculation of cosine similarity of each document with the query the order
of the documents if the query is ran will be;
D2D1D3
Vector space model using cosine similarity is a more effective method of showing which
documents are fetched by a search engine as compared to the Boolean model. This is
because vector space model shows the order in which the documents are fetched where by
the document with the highest cosine similarity is fetched first.
Question 2
Search engines
Bing
My Target
Obtain install document of MongDB
Search queries
Query 1= MongoDB install document
Query 2= mongoDB (installation, setup) document

Google Search engine
Figure 1: Google Search Engine
Figure 1: Google Search Engine
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Bing search engine
Figure 2: Bing search engine
Figure 2: Bing search engine
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Average for Google and Bing
Figure 3: Comparison by average
Evaluation
Figure 3 shows comparison by average of the two search engines. According to the chart Google beats
Bing search engine by both precision and recall. This means that for both queries Google search engine
is superior that Bing search engine as it is more precise and has a higher recall value when the same
queries are ran on both search engines
Bibliography
Manning, C., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval, Cambridge
University.
Figure 3: Comparison by average
Evaluation
Figure 3 shows comparison by average of the two search engines. According to the chart Google beats
Bing search engine by both precision and recall. This means that for both queries Google search engine
is superior that Bing search engine as it is more precise and has a higher recall value when the same
queries are ran on both search engines
Bibliography
Manning, C., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval, Cambridge
University.
1 out of 11
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.