Deakin University SIT773 Project: Inverted Index and Search Engines

Verified

Added on 2020/05/16

AI Summary

This project delves into the concepts of information retrieval, focusing on the creation of an inverted index from three documents related to science, computer vision, and artificial intelligence. It details the process of stop word elimination and the application of the Porter stemming algorithm to normalize the text. The project then constructs the inverted index, listing normalized tokens and their corresponding document IDs. The analysis extends to Boolean and vector queries, demonstrating their application and comparing their functionalities. The project also evaluates the performance of Google and Yahoo search engines using specific queries, assessing their precision and recall. The evaluation includes the comparison of the search engines' average performance, concluding with a bibliography of the resources used.

COVER PAGE

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
COVER PAGE................................................................................................................................................1
Question 1...................................................................................................................................................3
1 creating inverted index.........................................................................................................................3
Inverted index.........................................................................................................................................4
2 Boolean and vector queries..................................................................................................................8
Question 2...................................................................................................................................................9
Bibliography...............................................................................................................................................10

Question 1
1 creating inverted index
The three documents are;
Science (DOC 1)
Science is a systematic enterprise that builds and organizes knowledge in the form of testable
explanations and predictions about the universe
Computer vision (DOC 2)
Computer vision is a field of computer science that works on enabling computers to see, identify and
process images in the same way that human vision does.
Artificial intelligence (DOC 3)
Artificial Intelligence is a field that has a long history but is still constantly and actively growing and
changing
a. Elimination of stop words
Science (DOC 1)
Science systematic enterprise builds organizes knowledge testable explanations predictions universe
Computer vision (DOC 2)
Computer vision field computer science works enabling computers, identify process images human
vision
Artificial intelligence (DOC 3)
Artificial Intelligence field long history constantly actively growing changing
After applying Porter Stemming algorithm it becomes
Computer vision (Doc1)
Science system enterprise build organize knowledge test explain predict universe
Computer Vision (Doc2)
Compute vision field compute science work enable compute identify process image human vision
Artificial Intelligence (Doc3)
Artificial Intelligence field long history constant active grow change

Inverted index
Step 1: List normalized tokens for each document
Term Doc ID
Science 1
System 1
Enterprise 1
Build 1
Organize 1
Knowledge 1
Test 1
Explain 1
Predict 1
universe 1
Compute 2
Vision 2
Field 2
Compute 2
Science 2
Work 2
Enable 2
Compute 2
Identify 2
Process 2
Image 2
Human 2
vision 2
Artificial 3
Intelligence 3
Field 3
Long 3
History 3
Constant 3
Active 3
Grow 3
change 3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Step 2: Sort the terms alphabetically
Term Doc ID
Active 3
Artificial 3
Build 1
change 3
Compute 2
Compute 2
Compute 2
Constant 3
Enable 2
Enterprise 1
Explain 1
Field 2
Field 3
Grow 3
History 3
Human 2
Identify 2
Image 2
Intelligence 3
Knowledge 1
Long 3
Organize 1
Predict 1
Process 2
Science 1
Science 2
System 1
Test 1
universe 1
Vision 2
vision 2
Work 2

Step 3: Merge multiple occurrences of the same term
Term Freq Doc ID
Active 1 3
Artificial 1 3
Build 1 1
change 1 3
Compute 3 2
Constant 1 3
Enable 1 2
Enterprise 1 1
Explain 1 1
Field 1 2
Field 1 3
Grow 1 3
History 1 3
Human 1 2
Identify 1 2
Image 1 2
Intelligence 1 3
Knowledge 1 1
Long 1 3
Organize 1 1
Predict 1 1
Process 1 2
Science 1 1
Science 1 2
System 1 1
Test 1 1
universe 1 1
Vision 2 2
Work 1 2
a. Create dictionary and related posting file

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

b. Testing
After testing the some keywords in the inverted index using Google search engine, the results
returned were precise and related to the three documents
2 Boolean and vector queries
a. Boolean queries
i) (field Ʌ artificial Ʌ ¬build)
This query returns document 2 and 3
ii) ((field V feild) Ʌ science )
This query return doc1, doc2 and doc3
iii) ((intelligence V inteligence) Ʌ (science V sceince) Ʌ (field))
This query return doc1, doc2 and doc3
b. Vector model using cosine similarity
Given the query (science, field, intelligence, intelligence)
Document one is (science, field, field)
This results to three dimensions (science, field, intelligence)
i.e. for D1
D=”science, field, field” = <1,0,0>
Q=”science, field, intelligence, intelligence” = <1,1,2>
σ ( D1 , Q)= 1 x 1+ 0 x 1+ 0 x 2
√12 +02 +02 √12+12 +22 = 1
√1 √6 = 0.41
For D2
D=”science, field, field” = <1,2,0>
Q=”science, field, intelligence, intelligence” = <1,1,2>
σ ( D1 , Q)= 1 x 1+2 x 1+0 x 2
√12 +22 +02 √12 +12 +22 = 3
√ 5 √ 6 = 0.55
For D3
D=” science, field, field” = <0,2,1>
Q=” science, field, intelligence, intelligence” = <1,1,2>
σ ( D3 , Q)= 0 x 1+2 x 1+1 x 2
√02+ 22+12 √12+12 +22 = 3
√ 5 √ 6 = 0.55
According to the similarity index of each document with the query DOC2 and DOC3 have the
same similarity index while DOC1 has the lowest similarity index. This would mean that
DOC2 and DOC3 would appear as the top search results followed by D1.
The difference between Boolean model and vector model is that the Boolean shows which
document will appear in the search results but do not show the order in which the
documents will appear in the search results. The vector model shows the order in which the
documents will appear when the search query is ran depending on the similarity index of
the document with the query.

Question 2
a. Target and designed queries
My two search engines are Google and Yahoo
My Target is target 3; Obtain the unit guide of SIT773
Designed search queries
Query 1= SIT773 unit guide
Query 2= SIT773 (unit,course) guide
If these queries are expressed to Boolean queries they become
Query 1= (SIT773 Ʌ unit Ʌ guide)
Query 2= (SIT773 Ʌ (unit Ʌ course) Ʌ guide)
Google search engine precision vs recall for query 1, query 2 and the average
Figure 1: Google Search Engine

b. Bing search engine
Figure 2: Yahoo search engine
c. Average for Google and Yahoo
Figure 3:Comparison by average
Evaluation
According to the chat shown in figure 3, Google is more superior to yahoo. This is visible from the graph
as google is more precise than yahoo and it has higher recall value as compared to yahoo. Google’s
fraction of relevant results among retrieved results is higher than that of Yahoo. This is also applies to
the recall value where the fraction of the relevant results or documents that have been retrieved over
the total number relevant results is high as compared to Yahoo.
Bibliography
1&1, 2017. Information Retrieval: the Great Search for Knowledge. 1&1 Digital guide. Available at:
https://www.1and1.com/digitalguide/online-marketing/search-engine-marketing/information-retrieval-
how-search-engines-retrieve-data/ [Accessed February 2, 2018].