(PDF) Inverted indexes: Types and techniques
VerifiedAdded on 2021/05/31
|13
|1074
|26
AI Summary
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
COVER PAGE (ENTER YOUR DETAILS)
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Stop words removal.............................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Normalized tokens for each document...............................................................................................4
Merged normalized tokens for the three documents..............................................................................5
Sort the normalized tokens in alphabetical order................................................................................6
Create frequencies for tokens per document......................................................................................6
Dictionary and related posting file.......................................................................................................8
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets...................................................................................................................................................10
Search queries.......................................................................................................................................10
Google search Engine........................................................................................................................11
Bing Search Engine............................................................................................................................12
Average for Google and Bing.............................................................................................................13
Bibliography...............................................................................................................................................13
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Stop words removal.............................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Normalized tokens for each document...............................................................................................4
Merged normalized tokens for the three documents..............................................................................5
Sort the normalized tokens in alphabetical order................................................................................6
Create frequencies for tokens per document......................................................................................6
Dictionary and related posting file.......................................................................................................8
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets...................................................................................................................................................10
Search queries.......................................................................................................................................10
Google search Engine........................................................................................................................11
Bing Search Engine............................................................................................................................12
Average for Google and Bing.............................................................................................................13
Bibliography...............................................................................................................................................13
Question 1
Creating an inverted index
The following documents are used to create the inverted index
Document 1 (D1)
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2 (D2)
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3 (D3)
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
To create the inverted index, the following steps are followed;
Removing stop words
Applying porters algorithm
Create normalized tokens for all documents
Merge the tokens into one list and arrange them in alphabetical order
Add frequencies of each token.
Stop words removal
After removing the stop words the new documents become;
D1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
D2
Information retrieval finding material unstructured nature satisfies information within large
collections
D3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
After applying porter stemming algorithm;
D1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
D2
Information retrieve find material unstructure nature satisfy information within large collect
D3
Creating an inverted index
The following documents are used to create the inverted index
Document 1 (D1)
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2 (D2)
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3 (D3)
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
To create the inverted index, the following steps are followed;
Removing stop words
Applying porters algorithm
Create normalized tokens for all documents
Merge the tokens into one list and arrange them in alphabetical order
Add frequencies of each token.
Stop words removal
After removing the stop words the new documents become;
D1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
D2
Information retrieval finding material unstructured nature satisfies information within large
collections
D3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
After applying porter stemming algorithm;
D1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
D2
Information retrieve find material unstructure nature satisfy information within large collect
D3
Information system study complement network hardware software people organ collect filter
process create distribute data
Normalized tokens for each document
D1
Token Document ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
D2
Token Document ID
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
process create distribute data
Normalized tokens for each document
D1
Token Document ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
D2
Token Document ID
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Merged normalized tokens for the three documents
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Within 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Sort the normalized tokens in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Create frequencies for tokens per document
Term Frequency Doc ID
Active 1 1
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Create frequencies for tokens per document
Term Frequency Doc ID
Active 1 1
Base 2 1
collect 1 2
Complemen
t
1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
collect 1 2
Complemen
t
1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Dictionary and related posting file
Testing
Testing of the inverted index using information, system and index keywords was done using Google. The
results w
Testing
Testing of the inverted index using information, system and index keywords was done using Google. The
results w
Boolean and vector queries
a. Boolean queries
1) (information AND system AND index)
Result: D1, D2 and DOC3
2) (System AND index)
Result: D1, D3
3) (System AND NOT Index)
Result: D3
b. Vector model using cosine similarity
Query Q= (Information, system, index)
Cosine similarity;
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
σ ( D1 , Q)= 3 x 1+ 1 x 1+ 0 x 1
√32 +12 +02 √12+12 +12 = 4
√7 √3 = 1.15
For D2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
σ ( D2 , Q)= 2 x 1+ 0 x 1+0 x 1
√22 +02 +02 √12+ 12+12 = 2
√4 √3 = 0.76
For D3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
According to the results of each document the order in which the documents appear in
search results is
a. Boolean queries
1) (information AND system AND index)
Result: D1, D2 and DOC3
2) (System AND index)
Result: D1, D3
3) (System AND NOT Index)
Result: D3
b. Vector model using cosine similarity
Query Q= (Information, system, index)
Cosine similarity;
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
σ ( D1 , Q)= 3 x 1+ 1 x 1+ 0 x 1
√32 +12 +02 √12+12 +12 = 4
√7 √3 = 1.15
For D2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
σ ( D2 , Q)= 2 x 1+ 0 x 1+0 x 1
√22 +02 +02 √12+ 12+12 = 2
√4 √3 = 0.76
For D3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
According to the results of each document the order in which the documents appear in
search results is
D1D2D3
Question 2 IR evaluation
Search engines
Two of the top search engines to perform IR evaluation. The search engines are;
Google
Bing
Targets
Target 2: obtain the price of the new Samsung Tablet.
Target 3: obtain the manual of installing tera term
Search queries
Query 1= Samsung tablet price
Query 2= tera term installation manual
Question 2 IR evaluation
Search engines
Two of the top search engines to perform IR evaluation. The search engines are;
Bing
Targets
Target 2: obtain the price of the new Samsung Tablet.
Target 3: obtain the manual of installing tera term
Search queries
Query 1= Samsung tablet price
Query 2= tera term installation manual
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Google search Engine
Figure 1: Google Search Engine
Figure 1: Google Search Engine
Bing Search Engine
Figure 2: Yahoo search engine
Figure 2: Yahoo search engine
Average for Google and Bing
Figure 3: Comparison by average
According to figure 3 above which shows the average precision and recall for Google and Bing, Google is
has a higher recall value and is more precise then Bing as shown by the graph.
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.
Figure 3: Comparison by average
According to figure 3 above which shows the average precision and recall for Google and Bing, Google is
has a higher recall value and is more precise then Bing as shown by the graph.
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.
1 out of 13
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.