Information Retrieval: Inverted Index and Search Engines

Verified

Added on 2021/05/31

AI Summary

COVER PAGE (ENTER YOUR DETAILS)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Stop words removal.............................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Normalized tokens for each document...............................................................................................4
Merged normalized tokens for the three documents..............................................................................5
Sort the normalized tokens in alphabetical order................................................................................6
Create frequencies for tokens per document......................................................................................6
Dictionary and related posting file.......................................................................................................8
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets...................................................................................................................................................10
Search queries.......................................................................................................................................10
Google search Engine........................................................................................................................11
Bing Search Engine............................................................................................................................12
Average for Google and Bing.............................................................................................................13
Bibliography...............................................................................................................................................13

Question 1
Creating an inverted index
The following documents are used to create the inverted index
 Document 1 (D1)
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
 Document 2 (D2)
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
 Document 3 (D3)
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
To create the inverted index, the following steps are followed;
 Removing stop words
 Applying porters algorithm
 Create normalized tokens for all documents
 Merge the tokens into one list and arrange them in alphabetical order
 Add frequencies of each token.
Stop words removal
After removing the stop words the new documents become;
 D1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
 D2
Information retrieval finding material unstructured nature satisfies information within large
collections
 D3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
After applying porter stemming algorithm;
 D1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
 D2
Information retrieve find material unstructure nature satisfy information within large collect
 D3

Information system study complement network hardware software people organ collect filter
process create distribute data
Normalized tokens for each document
 D1
Token Document ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
 D2
Token Document ID
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Merged normalized tokens for the three documents
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3

Sort the normalized tokens in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Create frequencies for tokens per document
Term Frequency Doc ID
Active 1 1

Base 2 1
collect 1 2
Complemen
t
1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Dictionary and related posting file
Testing
Testing of the inverted index using information, system and index keywords was done using Google. The
results w

Boolean and vector queries
a. Boolean queries
1) (information AND system AND index)
Result: D1, D2 and DOC3
2) (System AND index)
Result: D1, D3
3) (System AND NOT Index)
Result: D3
b. Vector model using cosine similarity
Query Q= (Information, system, index)
Cosine similarity;
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
σ ( D1 , Q)= 3 x 1+ 1 x 1+ 0 x 1
√32 +12 +02 √12+12 +12 = 4
√7 √3 = 1.15
For D2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
σ ( D2 , Q)= 2 x 1+ 0 x 1+0 x 1
√22 +02 +02 √12+ 12+12 = 2
√4 √3 = 0.76
For D3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
According to the results of each document the order in which the documents appear in
search results is

D1D2D3
Question 2 IR evaluation
Search engines
Two of the top search engines to perform IR evaluation. The search engines are;
 Google
 Bing
Targets
Target 2: obtain the price of the new Samsung Tablet.
Target 3: obtain the manual of installing tera term
Search queries
 Query 1= Samsung tablet price
 Query 2= tera term installation manual

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Google search Engine
Figure 1: Google Search Engine

Bing Search Engine
Figure 2: Yahoo search engine

Average for Google and Bing
Figure 3: Comparison by average
According to figure 3 above which shows the average precision and recall for Google and Bing, Google is
has a higher recall value and is more precise then Bing as shown by the graph.
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.