Creating an Inverted Index and IR Evaluation for Desklib
VerifiedAdded on 2023/06/12
|12
|1121
|120
AI Summary
This article explains how to create an inverted index for Desklib, including removing stop words and using the Porter Stemming algorithm. It also covers sorting words in alphabetical order, word frequency per document, and testing with Boolean and vector queries. Additionally, it evaluates IR for Desklib using Google and Bing search engines, targets, and search queries.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
COVER PAGE (ENTER YOUR DETAILS)
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Remove stop words.............................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
The three documents merged.................................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency per document...........................................................................................................6
Posting file...........................................................................................................................................7
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets...................................................................................................................................................10
Designed Search queries.......................................................................................................................10
Google search Engine........................................................................................................................10
Bing Search Engine............................................................................................................................11
Average for Google and Bing.............................................................................................................11
Bibliography...............................................................................................................................................12
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Remove stop words.............................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
The three documents merged.................................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency per document...........................................................................................................6
Posting file...........................................................................................................................................7
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets...................................................................................................................................................10
Designed Search queries.......................................................................................................................10
Google search Engine........................................................................................................................10
Bing Search Engine............................................................................................................................11
Average for Google and Bing.............................................................................................................11
Bibliography...............................................................................................................................................12
Question 1
Creating an inverted index
Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
Remove stop words
Results
Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
Document 1
Informat retriev activ obtain inform resourc relev inform collect inform resourc Search base full
text content base index
Document 2
Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data
Creating an inverted index
Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
Remove stop words
Results
Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
Document 1
Informat retriev activ obtain inform resourc relev inform collect inform resourc Search base full
text content base index
Document 2
Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data
The three documents merged
Term Document
Informat 1
retriev 1
activ 1
Obtain 1
Inform 1
Resourc 1
Relev 1
Inform 1
Resourc 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Informat 2
Retriev 2
Find 2
Materi 2
Unstructur 2
Natur 2
Satisfi 2
Inform 2
Within 2
Larg 2
collect 2
Inform 3
System 3
Studi 3
Complementari 3
Network 3
Hardwar 3
Softwar 3
Peopl 3
Organ 3
collect 3
Filter 3
Process 3
creat 3
Distribut 3
data 3
Term Document
Informat 1
retriev 1
activ 1
Obtain 1
Inform 1
Resourc 1
Relev 1
Inform 1
Resourc 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Informat 2
Retriev 2
Find 2
Materi 2
Unstructur 2
Natur 2
Satisfi 2
Inform 2
Within 2
Larg 2
collect 2
Inform 3
System 3
Studi 3
Complementari 3
Network 3
Hardwar 3
Softwar 3
Peopl 3
Organ 3
collect 3
Filter 3
Process 3
creat 3
Distribut 3
data 3
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Sorting words in alphabetical order for all documents
Term Document
activ 1
Base 1
Base 1
collect 2
collect 3
Complementari 3
Content 1
creat 3
data 3
Distribut 3
Filter 3
Find 2
Full 1
Hardwar 3
index 1
Inform 1
Inform 1
Inform 2
Inform 3
Informat 3
Larg 2
Materi 2
Natur 2
Network 3
Obtain 1
Organ 3
Peopl 3
Process 3
Relev 1
Resourc 1
Resourc 1
retriev 1
Retriev 2
Satisfi 2
Search 1
Softwar 3
Term Document
activ 1
Base 1
Base 1
collect 2
collect 3
Complementari 3
Content 1
creat 3
data 3
Distribut 3
Filter 3
Find 2
Full 1
Hardwar 3
index 1
Inform 1
Inform 1
Inform 2
Inform 3
Informat 3
Larg 2
Materi 2
Natur 2
Network 3
Obtain 1
Organ 3
Peopl 3
Process 3
Relev 1
Resourc 1
Resourc 1
retriev 1
Retriev 2
Satisfi 2
Search 1
Softwar 3
Studi 3
System 3
Text 1
Unstructur 2
Within 2
Word frequency per document
Term Frequency Document
activ 1 1
Base 2 1
collect 1 2
collect 1 3
Complementari 1 3
Content 1 1
creat 1 3
data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
index 1 1
Inform 2 1
Inform 1 2
Inform 2 3
Larg 1 2
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
Relev 1 1
Resourc 2 1
Retriev 1 1
Retriev 1 2
System 3
Text 1
Unstructur 2
Within 2
Word frequency per document
Term Frequency Document
activ 1 1
Base 2 1
collect 1 2
collect 1 3
Complementari 1 3
Content 1 1
creat 1 3
data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
index 1 1
Inform 2 1
Inform 1 2
Inform 2 3
Larg 1 2
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
Relev 1 1
Resourc 2 1
Retriev 1 1
Retriev 1 2
Satisfi 1 2
Search 1 1
Softwar 1 3
Studi 1 3
System 1 3
Text 1 1
Unstructur 1 2
Within 1 2
Posting file
Search 1 1
Softwar 1 3
Studi 1 3
System 1 3
Text 1 1
Unstructur 1 2
Within 1 2
Posting file
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Testing
To test the documents, two search engines are tested with the words appearing most according to the
post file created above and then the results of both search engines are compared to determine whether
the returned documents are related to the original documents from which the posting file was made
from.
Boolean and vector queries
a. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
b. Vector queries
Getting the cosine similarity based on;
Query Q= (Information, system, index)
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
To test the documents, two search engines are tested with the words appearing most according to the
post file created above and then the results of both search engines are compared to determine whether
the returned documents are related to the original documents from which the posting file was made
from.
Boolean and vector queries
a. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
b. Vector queries
Getting the cosine similarity based on;
Query Q= (Information, system, index)
Document 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Based on the results of the cosine similarity the order in which the documents appear is a
search engine is;
1) Document 1
2) Document 2
3) Document 3
Question 2 IR evaluation
Search engines
Google
Bing
Targets
Target 1: Obtain the course information for S779.
Target 2: Obtain the price of new Samsung tablet.
Designed Search queries
Query 1= S779 course information
Query 2= new Samsung tablet price
search engine is;
1) Document 1
2) Document 2
3) Document 3
Question 2 IR evaluation
Search engines
Bing
Targets
Target 1: Obtain the course information for S779.
Target 2: Obtain the price of new Samsung tablet.
Designed Search queries
Query 1= S779 course information
Query 2= new Samsung tablet price
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Google search Engine
Figure 1: Google Search Engine
Bing Search Engine
Figure 2: Yahoo search engine
Figure 1: Google Search Engine
Bing Search Engine
Figure 2: Yahoo search engine
Average for Google and Bing
Figure 3: Comparison by average
From the graph shown in figure 3 above, it’s clear that Google search engine performs better compared
to Bing search engine. According to the graph, Google is more precise and has a higher recall value
based on the average of the two search queries.
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
Figure 3: Comparison by average
From the graph shown in figure 3 above, it’s clear that Google search engine performs better compared
to Bing search engine. According to the graph, Google is more precise and has a higher recall value
based on the average of the two search queries.
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
1 out of 12
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.