Inverted Index, Boolean Queries, and Search Engine Analysis Project

Verified

Added on  2021/05/30

|11
|1063
|30
Homework Assignment
AI Summary
This assignment solution provides a comprehensive overview of inverted indexes, boolean queries, and search engine evaluation. The solution begins by constructing an inverted index using stop word removal, Porter stemming, and tokenization steps. It then demonstrates boolean and vector queries using cosine similarity to rank documents. The assignment proceeds to evaluate search engines (Google and Yahoo) based on precision and recall metrics, comparing their performance using specific search queries. The solution includes detailed steps, examples, and visual representations (figures) to illustrate the concepts and results. This assignment is a valuable resource for students studying information retrieval and search engine optimization, offering practical insights into indexing techniques and query processing.
Document Page
COVER PAGE
DETAILS
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Contents
COVER PAGE........................................................................................................................................1
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.....................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: create normalized tokens from every document.....................................................................4
Step 2: Normalized tokens sorted in alphabetical order......................................................................5
Step 3: Merge terms appearing more than once.................................................................................6
b dictionary and related posting file....................................................................................................7
c. Testing..............................................................................................................................................8
2 Boolean and vector queries..................................................................................................................8
Question 2...................................................................................................................................................9
Search engines.........................................................................................................................................9
Target......................................................................................................................................................9
Search queries.........................................................................................................................................9
Bibliography...............................................................................................................................................10
Document Page
Question 1
Inverted index
Document for each topic
DOC1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
DOC2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
DOC3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index
a. Stop words removal
After removing the stop words the new documents become;
DOC1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC2
Information retrieval finding material unstructured nature satisfies information within large
collections
DOC3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information within large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
Document Page
Steps followed to create the inverted index
Step 1: create normalized tokens from every document
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
data 3
Step 2: Normalized tokens sorted in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Within 2
Document Page
Step 3: Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Within 1 2
Document Page
b dictionary and related posting file
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
c. Testing
Testing of the inverted index was done using the Google search engine where by the results that were
returned by the search engine related to the three of the topics.
2 Boolean and vector queries
a. Boolean queries
i) (information Ʌ system Ʌ index)
Result: return DOC1, DOC2 and DOC3
ii) ((information V info) Ʌ System Ʌ index )
This query return DOC1, DOC2 and DOC3
iii) (index Ʌ ¬System)
Result: DOC1
b. Vector model using cosine similarity
Given the query (information, information, system, index)
For each document (information, system, index)
This results to three dimensions (information, system, index) which act as the similarity
threshold
Cosine similarity;
DOC1
D1=<information, information, information, system,index> = <3, 1, 1>
Q=<information, information, system, index> = <2, 1, 1>
σ ( D1 , Q)= 3 x 2+1 x 1+ 1 x 1
32 +12 +12 22 +12 +12 = 8
11 6 = 1.54
D1=1.54
For D2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <2, 1, 0>
σ ( D2 , Q)= 2 x 2+ 0 x 1+0 x 0
22 +02 +02 22+ 12+ 02 = 4
4 5 = 1.34
D2=1.34
For D3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <2, 1, 1>
Document Page
σ ( D3 , Q)= 1 x 2+1 x 1+ 0 x 1
12 +12 +02 22+12 +12 = 3
2 5 = 1.42
D3=1.42
Based on the results the order in which the documents will appear in a search engine is
DOC1 then DOC3 and finally DOC2
Question 2 IR evaluation
Target and designed queries
Search engines
Google
Yahoo
Target
Target 1: obtain the course information for S779.
Target 4: obtain the oracle SQL tutorial.
Search queries
Query 1= ST779 Course information
Query 2= Oracle SQL tutorial
Google search engine precision vs recall for query 1, query 2 and the average
Figure 1: Google Search Engine
Document Page
a. Yahoo search engine precision vs recall for query 1, query 2 and the average
Figure 2: Yahoo search engine
b. Average for Google and Yahoo
Figure 3: Comparison by average
The most superior search engine because it has a higher recall value for all queries and is more precise
meaning the results obtained are more precise to the query as compared to Yahoo.
Bibliography
Lewandoski, D. (2009). The Retrieval Effectiveness of Web Search Engines: Considering Results
Descriptions. Information retrieval, [online] 1(1). Available at: https://arxiv.org/abs/1511.05800
[Accessed 20 May 2018].
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
chevron_up_icon
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]