Inverted Index, Boolean Queries, and Search Engine Analysis Project

Verified

Added on 2021/05/30

AI Summary

This assignment solution provides a comprehensive overview of inverted indexes, boolean queries, and search engine evaluation. The solution begins by constructing an inverted index using stop word removal, Porter stemming, and tokenization steps. It then demonstrates boolean and vector queries using cosine similarity to rank documents. The assignment proceeds to evaluate search engines (Google and Yahoo) based on precision and recall metrics, comparing their performance using specific search queries. The solution includes detailed steps, examples, and visual representations (figures) to illustrate the concepts and results. This assignment is a valuable resource for students studying information retrieval and search engine optimization, offering practical insights into indexing techniques and query processing.

COVER PAGE
DETAILS

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Contents
COVER PAGE........................................................................................................................................1
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.....................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: create normalized tokens from every document.....................................................................4
Step 2: Normalized tokens sorted in alphabetical order......................................................................5
Step 3: Merge terms appearing more than once.................................................................................6
b dictionary and related posting file....................................................................................................7
c. Testing..............................................................................................................................................8
2 Boolean and vector queries..................................................................................................................8
Question 2...................................................................................................................................................9
Search engines.........................................................................................................................................9
Target......................................................................................................................................................9
Search queries.........................................................................................................................................9
Bibliography...............................................................................................................................................10

Question 1
Inverted index
Document for each topic
 DOC1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
 DOC2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
 DOC3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index
a. Stop words removal
After removing the stop words the new documents become;
 DOC1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
 DOC2
Information retrieval finding material unstructured nature satisfies information within large
collections
 DOC3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
 DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
 DOC2
Information retrieve find material unstructure nature satisfy information within large collect
 DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data

Steps followed to create the inverted index
Step 1: create normalized tokens from every document
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

data 3
Step 2: Normalized tokens sorted in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Within 2

Step 3: Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Within 1 2

b dictionary and related posting file

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

c. Testing
Testing of the inverted index was done using the Google search engine where by the results that were
returned by the search engine related to the three of the topics.
2 Boolean and vector queries
a. Boolean queries
i) (information Ʌ system Ʌ index)
Result: return DOC1, DOC2 and DOC3
ii) ((information V info) Ʌ System Ʌ index )
This query return DOC1, DOC2 and DOC3
iii) (index Ʌ ¬System)
Result: DOC1
b. Vector model using cosine similarity
Given the query (information, information, system, index)
For each document (information, system, index)
This results to three dimensions (information, system, index) which act as the similarity
threshold
Cosine similarity;
DOC1
D1=<information, information, information, system,index> = <3, 1, 1>
Q=<information, information, system, index> = <2, 1, 1>
σ ( D1 , Q)= 3 x 2+1 x 1+ 1 x 1
√ 32 +12 +12 √ 22 +12 +12 = 8
√11 √6 = 1.54
D1=1.54
For D2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <2, 1, 0>
σ ( D2 , Q)= 2 x 2+ 0 x 1+0 x 0
√22 +02 +02 √22+ 12+ 02 = 4
√ 4 √ 5 = 1.34
D2=1.34
For D3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <2, 1, 1>

σ ( D3 , Q)= 1 x 2+1 x 1+ 0 x 1
√ 12 +12 +02 √ 22+12 +12 = 3
√2 √5 = 1.42
D3=1.42
Based on the results the order in which the documents will appear in a search engine is
DOC1 then DOC3 and finally DOC2
Question 2 IR evaluation
Target and designed queries
Search engines
 Google
 Yahoo
Target
Target 1: obtain the course information for S779.
Target 4: obtain the oracle SQL tutorial.
Search queries
 Query 1= ST779 Course information
 Query 2= Oracle SQL tutorial
Google search engine precision vs recall for query 1, query 2 and the average
Figure 1: Google Search Engine

a. Yahoo search engine precision vs recall for query 1, query 2 and the average
Figure 2: Yahoo search engine
b. Average for Google and Yahoo
Figure 3: Comparison by average
The most superior search engine because it has a higher recall value for all queries and is more precise
meaning the results obtained are more precise to the query as compared to Yahoo.
Bibliography
Lewandoski, D. (2009). The Retrieval Effectiveness of Web Search Engines: Considering Results
Descriptions. Information retrieval, [online] 1(1). Available at: https://arxiv.org/abs/1511.05800
[Accessed 20 May 2018].