SIT772 - Information Retrieval: Inverted Indexing and Evaluation
VerifiedAdded on 2023/06/11
|11
|1069
|461
Report
AI Summary
This report details the process of creating an inverted index for information retrieval, including removing stop words and punctuation, applying the Porter stemming algorithm, creating a dictionary, and generating posting files. It also covers testing the inverted index using boolean and vector queries. The report further evaluates information retrieval performance by comparing Google and Ask.com search engines based on target and designed queries, assessing recall and precision to determine the superior search engine. The analysis concludes that Google outperforms Ask.com due to its higher recall and precision scores across the tested queries.

COVER PAGE
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
COVER PAGE........................................................................................................................................1
DETAILS................................................................................................................................................1
Question 1...................................................................................................................................................3
a. Remove all stop words and punctuation.....................................................................................3
b. Applying Porter Stemming algorithm..........................................................................................3
c. Creating dictionary......................................................................................................................3
Merge all documents into one dictionary file......................................................................................3
Sort the merged in alphabetical order.................................................................................................4
Merge terms appearing more than once.............................................................................................5
Create posting file................................................................................................................................7
d. Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
a. Target and designed queries...........................................................................................................9
Search engines.....................................................................................................................................9
Target..................................................................................................................................................9
Target 1: obtain the course information for S779...............................................................................9
Search queries.....................................................................................................................................9
b. target, results and designed search queries........................................................................................9
A. Google Search engine......................................................................................................................9
B. Ask.com.........................................................................................................................................10
C. Average comparison......................................................................................................................11
Bibliography...............................................................................................................................................11
COVER PAGE........................................................................................................................................1
DETAILS................................................................................................................................................1
Question 1...................................................................................................................................................3
a. Remove all stop words and punctuation.....................................................................................3
b. Applying Porter Stemming algorithm..........................................................................................3
c. Creating dictionary......................................................................................................................3
Merge all documents into one dictionary file......................................................................................3
Sort the merged in alphabetical order.................................................................................................4
Merge terms appearing more than once.............................................................................................5
Create posting file................................................................................................................................7
d. Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
a. Target and designed queries...........................................................................................................9
Search engines.....................................................................................................................................9
Target..................................................................................................................................................9
Target 1: obtain the course information for S779...............................................................................9
Search queries.....................................................................................................................................9
b. target, results and designed search queries........................................................................................9
A. Google Search engine......................................................................................................................9
B. Ask.com.........................................................................................................................................10
C. Average comparison......................................................................................................................11
Bibliography...............................................................................................................................................11

Question 1
Creating inverted index.
a. Remove all stop words and punctuation
DOC 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC 2
Information retrieval finding material unstructured nature satisfies information large collections
DOC 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
b. Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
c. Creating dictionary
Merge all documents into one dictionary file
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Creating inverted index.
a. Remove all stop words and punctuation
DOC 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC 2
Information retrieval finding material unstructured nature satisfies information large collections
DOC 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
b. Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
c. Creating dictionary
Merge all documents into one dictionary file
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Sort the merged in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Sort the merged in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1

Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Create posting file
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

d. Testing
To search the inverted index using key words information, system and index revealed that the inverted
index is bound to return results that are related based on the search query.
Boolean and vector queries
e. Boolean queries
i) (information AND system AND index)= Doc 1, Doc 2 and DOC 3
ii) (System AND index)
Result: D1, D3
iii) (System OR Index)
Result: D3
f. Vector model using cosine similarity
Query Q= (Information, system, index)
Doc 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Doc 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Doc 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
The order of retrieval by a search engine is Doc 1, Doc 2 then Doc 3.
To search the inverted index using key words information, system and index revealed that the inverted
index is bound to return results that are related based on the search query.
Boolean and vector queries
e. Boolean queries
i) (information AND system AND index)= Doc 1, Doc 2 and DOC 3
ii) (System AND index)
Result: D1, D3
iii) (System OR Index)
Result: D3
f. Vector model using cosine similarity
Query Q= (Information, system, index)
Doc 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Doc 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Doc 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
The order of retrieval by a search engine is Doc 1, Doc 2 then Doc 3.

Question 2 IR evaluation
a. Target and designed queries
Search engines
Google
Ask.com
Target
Target 1: obtain the course information for S779
Search queries
Query 1=S779 course information
Query 2= s779 course manual
b. target, results and designed search queries
A. Google Search engine
Figure 1: Google Search Engine
a. Target and designed queries
Search engines
Ask.com
Target
Target 1: obtain the course information for S779
Search queries
Query 1=S779 course information
Query 2= s779 course manual
b. target, results and designed search queries
A. Google Search engine
Figure 1: Google Search Engine
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

B. Ask.com
Figure 2: Ask.com search engine
Figure 2: Ask.com search engine
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

C. Average comparison
Figure 3: average comparison for both search engines
By evaluating the line graph Google search engine is more powerful or superior to ask.com because it
has a higher recall and precision based on the average of the search results of both queries. Recall is the
number of search results returned by the search engine. Google has higher recall for both queries.
Recall is the number of relevant results of all the total results returned. Google has a higher recall value
ask.com because the number of relevant results returned by the search engine that are related to both
queries are high. Thus generally Google is more powerful than ask.com
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.
Figure 3: average comparison for both search engines
By evaluating the line graph Google search engine is more powerful or superior to ask.com because it
has a higher recall and precision based on the average of the search results of both queries. Recall is the
number of search results returned by the search engine. Google has higher recall for both queries.
Recall is the number of relevant results of all the total results returned. Google has a higher recall value
ask.com because the number of relevant results returned by the search engine that are related to both
queries are high. Thus generally Google is more powerful than ask.com
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.
1 out of 11
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2025 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.