SIT772: Database and Information Retrieval Techniques Project Analysis

Verified

Added on  2023/04/21

|18
|1714
|451
Project
AI Summary
This project report details the implementation and evaluation of information retrieval techniques. The first part focuses on creating an inverted index from three documents, including stop word removal and Porter stemming. The documents are then merged, and words are sorted alphabetically. The report calculates word frequencies per document, constructs a posting file, and tests it using Boolean and vector queries. The second part evaluates search engines (Google and Yahoo) using designed search queries and analyzes their performance based on precision and recall metrics. The report includes interpolation precision graphs and average precision calculations to compare the effectiveness of the search engines. Finally, the report concludes with a comparison of Google and Yahoo, highlighting their relative performance based on the obtained results and provides a bibliography of cited sources.
Document Page
COVER PAGE (ENTER YOUR DETAILS)
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Remove stop words.............................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
The three documents merged.................................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency per document...........................................................................................................6
Posting file...........................................................................................................................................7
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Designed Search queries.......................................................................................................................10
Google. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Yahoo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Bibliography...............................................................................................................................................18
Document Page
Question 1
Creating an inverted index
ï‚· Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
ï‚· Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
ï‚· Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
Remove stop words
Results
ï‚· Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
ï‚· Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
ï‚· Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
ï‚· Document 1
Informat retriev activ obtain inform resourc relev inform collect inform resourc Search base full
text content base index
ï‚· Document 2
Informat retriev find materi unstructur natur satisfi inform within larg collect
ï‚· Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data
Document Page
The three documents merged
Term Document
Informat 1
retriev 1
activ 1
Obtain 1
Inform 1
Resourc 1
Relev 1
Inform 1
Resourc 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Informat 2
Retriev 2
Find 2
Materi 2
Unstructur 2
Natur 2
Satisfi 2
Inform 2
Within 2
Larg 2
collect 2
Inform 3
System 3
Studi 3
Complementari 3
Network 3
Hardwar 3
Softwar 3
Peopl 3
Organ 3
collect 3
Filter 3
Process 3
creat 3
Distribut 3
data 3
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Sorting words in alphabetical order for all documents
Term Document
activ 1
Base 1
Base 1
collect 2
collect 3
Complementari 3
Content 1
creat 3
data 3
Distribut 3
Filter 3
Find 2
Full 1
Hardwar 3
index 1
Inform 1
Inform 1
Inform 2
Inform 3
Informat 3
Larg 2
Materi 2
Natur 2
Network 3
Obtain 1
Organ 3
Peopl 3
Process 3
Relev 1
Resourc 1
Resourc 1
retriev 1
Retriev 2
Satisfi 2
Search 1
Softwar 3
Document Page
Studi 3
System 3
Text 1
Unstructur 2
Within 2
Word frequency per document
Term Frequency Document
activ 1 1
Base 2 1
collect 1 2
collect 1 3
Complementari 1 3
Content 1 1
creat 1 3
data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
index 1 1
Inform 2 1
Inform 1 2
Inform 2 3
Larg 1 2
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
Relev 1 1
Resourc 2 1
Retriev 1 1
Retriev 1 2
Document Page
Satisfi 1 2
Search 1 1
Softwar 1 3
Studi 1 3
System 1 3
Text 1 1
Unstructur 1 2
Within 1 2
Posting file
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Document Page
Testing
To test the posting file, the key words in the posting file are searched for using a search engine like
google and the documents retrieved are evaluated to see if they match with the original documents.
From testing file above using Google Search Engine the documents had some degree of similarity with
the original documents.
Boolean and vector queries
a. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
b. Vector queries
Query Q= (Information, system, index)
Document 1
D1= <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2= <2, 0, 0>
Q = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Document Page
The documents are retrieved in the following order Doc1 , Doc 2 then Doc 3
Question 2 IR evaluation
Search engines
ï‚· Google
ï‚· Yahoo
Targets
Target 2: Obtain the price of new Xbox One
Designed Search queries
ï‚· Query 1= New xbox one price
ï‚· Query 2= new Xbox One Cost
Google
Query 1
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Query 2
Document Page
Query 1 Query 2
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.5 0.143
R 1 0.357 R 0.6 0.214
R 0.833 0.429 0.5 0.214
0.714 0.429 0.429 0.214
0.625 0.429 0.375 0.214
R 0.667 0.5 0.444 0.286
0.6 0.5 0.4 0.286
0.636 0.571 R 0.455 0.358
R 0.667 0.643 0.417 0.358
0.692 0.714 R 0.462 0.429
0.643 0.714 0.4 0.429
0.6 0.714 R 0.4375 0.5
0.5625 0.714 R 0.412 0.5
0.529 0.714 0.389 0.5
0.556 0.786 0.368 0.5
R 0.526 0.786 0.4 0.571
R 0.55 0.857
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.6 0.2 0.8
0.3 1 0.3 0.455 0.3 0.7275
0.4 0.833 0.4 0.462 0.4 0.6475
0.5 0.667 0.5 0.4375 0.5 0.55225
0.6 0.667 0.6 0.412 0.6 0.5395
0.7 0.526 0.7 0 0.7 0.263
0.8 0.55 0.8 0 0.8 0.275
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0
Document Page
Figure 1: Google Search Engine
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Yahoo
Query 1
Query 2
Document Page
Query 1 Query 2
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.75 0.214
R 1 0.357 R 0.8 0.289
R 0.833 0.429 0.667 0.289
0.714 0.429 0.571 0.289
0.625 0.429 R 0.625 0.357
R 0.667 0.5 R 0.667 0.429
0.6 0.5 0.6 0.429
0.636 0.571 0.636 0.5
R 0.667 0.643 R 0.667 0.571
0.692 0.714 R 0.615 0.571
0.643 0.714 0.571 0.571
0.6 0.714 R 0.6 0.643
0.5625 0.714 0.5625 0.643
0.529 0.714 0.529 0.643
0.556 0.786 0.556 0.714
R 0.526 0.786 0.526 0.714
R 0.55 0.857 R 0.55 0.786
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.8 0.2 0.9
0.3 1 0.3 0.625 0.3 0.8125
0.4 0.833 0.4 0.667 0.4 0.75
0.5 0.667 0.5 0.667 0.5 0.667
0.6 0.667 0.6 0.615 0.6 0.641
0.7 0.526 0.7 0.6 0.7 0.563
0.8 0.55 0.8 0.55 0.8 0.55
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0
Document Page
Figure 2: Yahoo search engine
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Average
Google Yahoo
Average Precision Average Precision
0 1 0 1
0.1 1 0.1 1
0.2 0.8 0.2 0.9
0.3 0.7275 0.3 0.8125
0.4 0.6475 0.4 0.75
0.5 0.55225 0.5 0.667
0.6 0.5395 0.6 0.641
0.7 0.263 0.7 0.563
0.8 0.275 0.8 0.55
0.9 0 0.9 0
1 0 1 0
Figure 3: Comparison by average
Document Page
According to the graph shown above for the comparison of Google and Yahoo, Yahoo performs better
than Google because it has a higher precision and recall
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
chevron_up_icon
1 out of 18
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]