SIT772 - Information Retrieval: Inverted Indexing and Evaluation

Verified

Added on  2023/06/11

|11
|1069
|461
Report
AI Summary
This report details the process of creating an inverted index for information retrieval, including removing stop words and punctuation, applying the Porter stemming algorithm, creating a dictionary, and generating posting files. It also covers testing the inverted index using boolean and vector queries. The report further evaluates information retrieval performance by comparing Google and Ask.com search engines based on target and designed queries, assessing recall and precision to determine the superior search engine. The analysis concludes that Google outperforms Ask.com due to its higher recall and precision scores across the tested queries.
Document Page
COVER PAGE
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Contents
COVER PAGE........................................................................................................................................1
DETAILS................................................................................................................................................1
Question 1...................................................................................................................................................3
a. Remove all stop words and punctuation.....................................................................................3
b. Applying Porter Stemming algorithm..........................................................................................3
c. Creating dictionary......................................................................................................................3
Merge all documents into one dictionary file......................................................................................3
Sort the merged in alphabetical order.................................................................................................4
Merge terms appearing more than once.............................................................................................5
Create posting file................................................................................................................................7
d. Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
a. Target and designed queries...........................................................................................................9
Search engines.....................................................................................................................................9
Target..................................................................................................................................................9
Target 1: obtain the course information for S779...............................................................................9
Search queries.....................................................................................................................................9
b. target, results and designed search queries........................................................................................9
A. Google Search engine......................................................................................................................9
B. Ask.com.........................................................................................................................................10
C. Average comparison......................................................................................................................11
Bibliography...............................................................................................................................................11
Document Page
Question 1
Creating inverted index.
a. Remove all stop words and punctuation
DOC 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
DOC 2
Information retrieval finding material unstructured nature satisfies information large collections
DOC 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
b. Applying Porter Stemming algorithm
The next step is applying porter stemming algorithm and as a result the new documents become;
DOC1
Information retrieve active obtain inform resource relevant information collect information
resource Search base full text content base index
DOC2
Information retrieve find material unstructure nature satisfy information large collect
DOC3
Information system study complement network hardware software people organ collect filter
process create distribute data
c. Creating dictionary
Merge all documents into one dictionary file
Term Doc ID
Information 1
Retrieve 1
Active 1
Obtain 1
Inform 1
Resource 1
Relevant 1
Information 1
Resource 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Information 2
Document Page
Retrieve 2
Find 2
Material 2
Unstructure 2
Nature 2
Satisfy 2
Information 2
Large 2
collect 2
Information 3
System 3
Study 3
Complement 3
Network 3
Hardware 3
Software 3
People 3
Organ 3
collect 3
Filter 3
Process 3
create 3
Distribute 3
data 3
Sort the merged in alphabetical order
Term Doc ID
Active 1
Base 1
Base 1
collect 2
Complemen
t
3
Content 1
Find 2
Full 1
Hardware 3
index 1
Information 1
Information 2
Information 1
Information 1
Information 2
Information 3
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Large 2
Material 2
Nature 2
Network 3
Obtain 1
People 3
Relevant 1
Resource 1
Resource 1
Retrieve 1
Retrieve 2
Satisfy 2
Search 1
Software 3
Study 3
System 3
Text 1
Unstructure 2
Merge terms appearing more than once
Term Frequenc
y
Doc ID
Active 1 1
Base 2 1
collect 1 2
Complement 1 3
Content 1 1
Find 1 2
Full 1 1
Hardware 1 3
index 1 1
Information 3 1
Information 2 2
Information 1 3
Large 1 2
Material 1 2
Nature 1 2
Network 1 3
Obtain 1 1
People 1 3
Relevant 1 1
Resource 2 1
Document Page
Retrieve 1 1
Retrieve 1 2
Satisfy 1 2
Search 1 1
Software 1 3
Study 1 3
System 1 3
Text 1 1
Unstructure 1 2
Document Page
Create posting file
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
d. Testing
To search the inverted index using key words information, system and index revealed that the inverted
index is bound to return results that are related based on the search query.
Boolean and vector queries
e. Boolean queries
i) (information AND system AND index)= Doc 1, Doc 2 and DOC 3
ii) (System AND index)
Result: D1, D3
iii) (System OR Index)
Result: D3
f. Vector model using cosine similarity
Query Q= (Information, system, index)
Doc 1
D1=<information, information, information, system, index> = <3, 1, 0>
Q=< information, system, index> = <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
32+12 +02 12+12+12 = 4
7 3 = 1.15
Doc 2
D2=<information, information> = <2, 0, 0>
Q=<information, information, system, index> = <1, 1, 1>
2 x 1+0 x 1+0 x 1
22 +02+ 02 12 +12+12 = 2
4 3 = 0.76
Doc 3
D=<information, system>= <1, 1, 0>
Q=<information, information, system, index> <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
12 +12 +02 12+12 +12 = 2
2 3 = 1.07
The order of retrieval by a search engine is Doc 1, Doc 2 then Doc 3.
Document Page
Question 2 IR evaluation
a. Target and designed queries
Search engines
Google
Ask.com
Target
Target 1: obtain the course information for S779
Search queries
Query 1=S779 course information
Query 2= s779 course manual
b. target, results and designed search queries
A. Google Search engine
Figure 1: Google Search Engine
Document Page
B. Ask.com
Figure 2: Ask.com search engine
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
C. Average comparison
Figure 3: average comparison for both search engines
By evaluating the line graph Google search engine is more powerful or superior to ask.com because it
has a higher recall and precision based on the average of the search results of both queries. Recall is the
number of search results returned by the search engine. Google has higher recall for both queries.
Recall is the number of relevant results of all the total results returned. Google has a higher recall value
ask.com because the number of relevant results returned by the search engine that are related to both
queries are high. Thus generally Google is more powerful than ask.com
Bibliography
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge
University.
chevron_up_icon
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]