Inverted Index and IR Evaluation for Desklib Online Library
VerifiedAdded on 2023/06/12
|11
|1174
|150
AI Summary
This article discusses the creation of an inverted index for Desklib online library, including stop words removal and Porter Stemming algorithm. It also covers IR evaluation using Google and Ask.com search engines, with target queries and search queries provided. The article includes step-by-step instructions and relevant screenshots.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
COVER PAGE
DETAILS
DETAILS
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Contents
COVER PAGE........................................................................................................................................1
DETAILS................................................................................................................................................1
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.....................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: create normalized tokens from every document.....................................................................4
Step 2: Normalized tokens sorted in alphabetical order......................................................................5
b. Step 3: Merge terms appearing more than once.........................................................................5
c. dictionary and related posting file...............................................................................................7
d. Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
a. Target and designed queries...........................................................................................................9
Search engines.....................................................................................................................................9
Target..................................................................................................................................................9
Search queries.....................................................................................................................................9
b. List your target, results and designed search queries..........................................................................9
A. Google Search engine......................................................................................................................9
B. Ask.com.........................................................................................................................................10
C. Average comparison......................................................................................................................11
COVER PAGE........................................................................................................................................1
DETAILS................................................................................................................................................1
Question 1...................................................................................................................................................3
Inverted index.........................................................................................................................................3
Document for each topic.....................................................................................................................3
a. Stop words removal.....................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
Steps followed to create the inverted index............................................................................................4
Step 1: create normalized tokens from every document.....................................................................4
Step 2: Normalized tokens sorted in alphabetical order......................................................................5
b. Step 3: Merge terms appearing more than once.........................................................................5
c. dictionary and related posting file...............................................................................................7
d. Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
a. Target and designed queries...........................................................................................................9
Search engines.....................................................................................................................................9
Target..................................................................................................................................................9
Search queries.....................................................................................................................................9
b. List your target, results and designed search queries..........................................................................9
A. Google Search engine......................................................................................................................9
B. Ask.com.........................................................................................................................................10
C. Average comparison......................................................................................................................11
Question 1
Inverted index
Document for each topic
DOC1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
DOC2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
DOC3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index
a. Stop words removal
After removing the stop words the new documents become;
DOC1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC2
Information retrieval finding material unstructured nature satisfies information large collections
DOC3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
Inverted index
Document for each topic
DOC1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
DOC2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
DOC3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index
a. Stop words removal
After removing the stop words the new documents become;
DOC1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC2
Information retrieval finding material unstructured nature satisfies information large collections
DOC3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
Steps followed to create the inverted index
Step 1: create normalized tokens from every document
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Step 1: create normalized tokens from every document
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Step 2: Normalized tokens sorted in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
b. Step 3: Merge terms appearing more than once
Term Frequenc
y
Doc ID
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
b. Step 3: Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
c. dictionary and related posting file
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
d. Testing
By using google to test the inverted index the results were returned by the search engines were related.
Boolean and vector queries
e. Boolean queries
i) (information AND system AND index)
Result: D1, D2 and DOC3
ii) (System AND index)
Result: D1, D3
iii) (System AND NOT Index)
Result: D3
f. Vector model using cosine similarity
Query Q= (Information, system, index)
Cosine similarity;
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
σ ( D1 , Q)= 3 x 1+ 1 x 1+ 0 x 1
√32 +12 +02 √12+12 +12 = 4
√7 √3 = 1.15
Document 12
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
σ ( D2 , Q)= 2 x 1+ 0 x 1+0 x 1
√22 +02 +02 √12+ 12+12 = 2
√4 √3 = 0.76
Document 13
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
When the query is searched the documents will appear in the following order
By using google to test the inverted index the results were returned by the search engines were related.
Boolean and vector queries
e. Boolean queries
i) (information AND system AND index)
Result: D1, D2 and DOC3
ii) (System AND index)
Result: D1, D3
iii) (System AND NOT Index)
Result: D3
f. Vector model using cosine similarity
Query Q= (Information, system, index)
Cosine similarity;
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
σ ( D1 , Q)= 3 x 1+ 1 x 1+ 0 x 1
√32 +12 +02 √12+12 +12 = 4
√7 √3 = 1.15
Document 12
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
σ ( D2 , Q)= 2 x 1+ 0 x 1+0 x 1
√22 +02 +02 √12+ 12+12 = 2
√4 √3 = 0.76
Document 13
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
When the query is searched the documents will appear in the following order
1. Document 1
2. Document 2
3. Document 3
Question 2 IR evaluation
a. Target and designed queries
Search engines
Google
Ask.com
Target
Target 4: obtain the oracle SQL tutorial.
Search queries
Query 1= oracle SQL manual
Query 2= Oracle SQL tutorial
b. List your target, results and designed search queries
A. Google Search engine
Figure 1: Google Search Engine
Key
Green ------ = precision
White ------ = recall
2. Document 2
3. Document 3
Question 2 IR evaluation
a. Target and designed queries
Search engines
Ask.com
Target
Target 4: obtain the oracle SQL tutorial.
Search queries
Query 1= oracle SQL manual
Query 2= Oracle SQL tutorial
b. List your target, results and designed search queries
A. Google Search engine
Figure 1: Google Search Engine
Key
Green ------ = precision
White ------ = recall
B. Ask.com
Figure 2: Ask.com search engine
Key
Green ------ = precision
White ------ = recall
Figure 2: Ask.com search engine
Key
Green ------ = precision
White ------ = recall
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
C. Average comparison
Figure 3: average comparison
Key
Green ------ = precision
White ------ = recall
Based on the graph show in figure 3 above Google is more superior to Ask. Com because based on the
two queries Google demonstrates higher precision and recall than ask.com thus this makes Google more
superior to ask as the user is able to get more results related to what he or she is searching for.
Figure 3: average comparison
Key
Green ------ = precision
White ------ = recall
Based on the graph show in figure 3 above Google is more superior to Ask. Com because based on the
two queries Google demonstrates higher precision and recall than ask.com thus this makes Google more
superior to ask as the user is able to get more results related to what he or she is searching for.
1 out of 11
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.