SIT772: Database and Information Retrieval Techniques Project Analysis
VerifiedAdded on  2023/04/21
|18
|1714
|451
Project
AI Summary
This project report details the implementation and evaluation of information retrieval techniques. The first part focuses on creating an inverted index from three documents, including stop word removal and Porter stemming. The documents are then merged, and words are sorted alphabetically. The report calculates word frequencies per document, constructs a posting file, and tests it using Boolean and vector queries. The second part evaluates search engines (Google and Yahoo) using designed search queries and analyzes their performance based on precision and recall metrics. The report includes interpolation precision graphs and average precision calculations to compare the effectiveness of the search engines. Finally, the report concludes with a comparison of Google and Yahoo, highlighting their relative performance based on the obtained results and provides a bibliography of cited sources.

COVER PAGE (ENTER YOUR DETAILS)
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Remove stop words.............................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
The three documents merged.................................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency per document...........................................................................................................6
Posting file...........................................................................................................................................7
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Designed Search queries.......................................................................................................................10
Google. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Yahoo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Bibliography...............................................................................................................................................18
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
Remove stop words.............................................................................................................................3
Porter Stemming algorithm.................................................................................................................3
The three documents merged.................................................................................................................4
Sorting words in alphabetical order for all documents........................................................................5
Word frequency per document...........................................................................................................6
Posting file...........................................................................................................................................7
Testing.................................................................................................................................................9
Boolean and vector queries.....................................................................................................................9
Question 2 IR evaluation...........................................................................................................................10
Search engines.......................................................................................................................................10
Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Designed Search queries.......................................................................................................................10
Google. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Yahoo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Bibliography...............................................................................................................................................18

Question 1
Creating an inverted index
ï‚· Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
ï‚· Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
ï‚· Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
Remove stop words
Results
ï‚· Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
ï‚· Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
ï‚· Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
ï‚· Document 1
Informat retriev activ obtain inform resourc relev inform collect inform resourc Search base full
text content base index
ï‚· Document 2
Informat retriev find materi unstructur natur satisfi inform within larg collect
ï‚· Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data
Creating an inverted index
ï‚· Document 1
Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text
or other content-based indexing.
ï‚· Document 2
Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections
ï‚· Document 3
Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
Remove stop words
Results
ï‚· Document 1
Information retrieval activity obtaining information resources relevant information collection
information resources Searches based full-text content-based indexing
ï‚· Document 2
Information retrieval finding material unstructured nature satisfies information within large
collections
ï‚· Document 3
Information systems study complementary networks hardware software people organizations
collect filter process create distribute data
Porter Stemming algorithm
Results
ï‚· Document 1
Informat retriev activ obtain inform resourc relev inform collect inform resourc Search base full
text content base index
ï‚· Document 2
Informat retriev find materi unstructur natur satisfi inform within larg collect
ï‚· Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data

The three documents merged
Term Document
Informat 1
retriev 1
activ 1
Obtain 1
Inform 1
Resourc 1
Relev 1
Inform 1
Resourc 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Informat 2
Retriev 2
Find 2
Materi 2
Unstructur 2
Natur 2
Satisfi 2
Inform 2
Within 2
Larg 2
collect 2
Inform 3
System 3
Studi 3
Complementari 3
Network 3
Hardwar 3
Softwar 3
Peopl 3
Organ 3
collect 3
Filter 3
Process 3
creat 3
Distribut 3
data 3
Term Document
Informat 1
retriev 1
activ 1
Obtain 1
Inform 1
Resourc 1
Relev 1
Inform 1
Resourc 1
Search 1
Base 1
Full 1
Text 1
Content 1
Base 1
index 1
Informat 2
Retriev 2
Find 2
Materi 2
Unstructur 2
Natur 2
Satisfi 2
Inform 2
Within 2
Larg 2
collect 2
Inform 3
System 3
Studi 3
Complementari 3
Network 3
Hardwar 3
Softwar 3
Peopl 3
Organ 3
collect 3
Filter 3
Process 3
creat 3
Distribut 3
data 3
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Sorting words in alphabetical order for all documents
Term Document
activ 1
Base 1
Base 1
collect 2
collect 3
Complementari 3
Content 1
creat 3
data 3
Distribut 3
Filter 3
Find 2
Full 1
Hardwar 3
index 1
Inform 1
Inform 1
Inform 2
Inform 3
Informat 3
Larg 2
Materi 2
Natur 2
Network 3
Obtain 1
Organ 3
Peopl 3
Process 3
Relev 1
Resourc 1
Resourc 1
retriev 1
Retriev 2
Satisfi 2
Search 1
Softwar 3
Term Document
activ 1
Base 1
Base 1
collect 2
collect 3
Complementari 3
Content 1
creat 3
data 3
Distribut 3
Filter 3
Find 2
Full 1
Hardwar 3
index 1
Inform 1
Inform 1
Inform 2
Inform 3
Informat 3
Larg 2
Materi 2
Natur 2
Network 3
Obtain 1
Organ 3
Peopl 3
Process 3
Relev 1
Resourc 1
Resourc 1
retriev 1
Retriev 2
Satisfi 2
Search 1
Softwar 3

Studi 3
System 3
Text 1
Unstructur 2
Within 2
Word frequency per document
Term Frequency Document
activ 1 1
Base 2 1
collect 1 2
collect 1 3
Complementari 1 3
Content 1 1
creat 1 3
data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
index 1 1
Inform 2 1
Inform 1 2
Inform 2 3
Larg 1 2
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
Relev 1 1
Resourc 2 1
Retriev 1 1
Retriev 1 2
System 3
Text 1
Unstructur 2
Within 2
Word frequency per document
Term Frequency Document
activ 1 1
Base 2 1
collect 1 2
collect 1 3
Complementari 1 3
Content 1 1
creat 1 3
data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
index 1 1
Inform 2 1
Inform 1 2
Inform 2 3
Larg 1 2
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
Relev 1 1
Resourc 2 1
Retriev 1 1
Retriev 1 2

Satisfi 1 2
Search 1 1
Softwar 1 3
Studi 1 3
System 1 3
Text 1 1
Unstructur 1 2
Within 1 2
Posting file
Search 1 1
Softwar 1 3
Studi 1 3
System 1 3
Text 1 1
Unstructur 1 2
Within 1 2
Posting file
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser


Testing
To test the posting file, the key words in the posting file are searched for using a search engine like
google and the documents retrieved are evaluated to see if they match with the original documents.
From testing file above using Google Search Engine the documents had some degree of similarity with
the original documents.
Boolean and vector queries
a. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
b. Vector queries
Query Q= (Information, system, index)
Document 1
D1= <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2= <2, 0, 0>
Q = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
To test the posting file, the key words in the posting file are searched for using a search engine like
google and the documents retrieved are evaluated to see if they match with the original documents.
From testing file above using Google Search Engine the documents had some degree of similarity with
the original documents.
Boolean and vector queries
a. Boolean queries
1) information AND system AND index= Doc1, Doc2 , DOC3
2) System AND index= Doc1, Doc3
3) Information AND Index= Doc1, Doc2 , DOC3
b. Vector queries
Query Q= (Information, system, index)
Document 1
D1= <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Document 2
D2= <2, 0, 0>
Q = <1, 1, 1>
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Document 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07

The documents are retrieved in the following order Doc1 , Doc 2 then Doc 3
Question 2 IR evaluation
Search engines
ï‚· Google
ï‚· Yahoo
Targets
Target 2: Obtain the price of new Xbox One
Designed Search queries
ï‚· Query 1= New xbox one price
ï‚· Query 2= new Xbox One Cost
Google
Query 1
Question 2 IR evaluation
Search engines
ï‚· Google
ï‚· Yahoo
Targets
Target 2: Obtain the price of new Xbox One
Designed Search queries
ï‚· Query 1= New xbox one price
ï‚· Query 2= new Xbox One Cost
Query 1
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Query 2

Query 1 Query 2
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.5 0.143
R 1 0.357 R 0.6 0.214
R 0.833 0.429 0.5 0.214
0.714 0.429 0.429 0.214
0.625 0.429 0.375 0.214
R 0.667 0.5 0.444 0.286
0.6 0.5 0.4 0.286
0.636 0.571 R 0.455 0.358
R 0.667 0.643 0.417 0.358
0.692 0.714 R 0.462 0.429
0.643 0.714 0.4 0.429
0.6 0.714 R 0.4375 0.5
0.5625 0.714 R 0.412 0.5
0.529 0.714 0.389 0.5
0.556 0.786 0.368 0.5
R 0.526 0.786 0.4 0.571
R 0.55 0.857
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.6 0.2 0.8
0.3 1 0.3 0.455 0.3 0.7275
0.4 0.833 0.4 0.462 0.4 0.6475
0.5 0.667 0.5 0.4375 0.5 0.55225
0.6 0.667 0.6 0.412 0.6 0.5395
0.7 0.526 0.7 0 0.7 0.263
0.8 0.55 0.8 0 0.8 0.275
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.5 0.143
R 1 0.357 R 0.6 0.214
R 0.833 0.429 0.5 0.214
0.714 0.429 0.429 0.214
0.625 0.429 0.375 0.214
R 0.667 0.5 0.444 0.286
0.6 0.5 0.4 0.286
0.636 0.571 R 0.455 0.358
R 0.667 0.643 0.417 0.358
0.692 0.714 R 0.462 0.429
0.643 0.714 0.4 0.429
0.6 0.714 R 0.4375 0.5
0.5625 0.714 R 0.412 0.5
0.529 0.714 0.389 0.5
0.556 0.786 0.368 0.5
R 0.526 0.786 0.4 0.571
R 0.55 0.857
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.6 0.2 0.8
0.3 1 0.3 0.455 0.3 0.7275
0.4 0.833 0.4 0.462 0.4 0.6475
0.5 0.667 0.5 0.4375 0.5 0.55225
0.6 0.667 0.6 0.412 0.6 0.5395
0.7 0.526 0.7 0 0.7 0.263
0.8 0.55 0.8 0 0.8 0.275
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0

Figure 1: Google Search Engine
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Yahoo
Query 1
Query 2
Query 1
Query 2

Query 1 Query 2
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.75 0.214
R 1 0.357 R 0.8 0.289
R 0.833 0.429 0.667 0.289
0.714 0.429 0.571 0.289
0.625 0.429 R 0.625 0.357
R 0.667 0.5 R 0.667 0.429
0.6 0.5 0.6 0.429
0.636 0.571 0.636 0.5
R 0.667 0.643 R 0.667 0.571
0.692 0.714 R 0.615 0.571
0.643 0.714 0.571 0.571
0.6 0.714 R 0.6 0.643
0.5625 0.714 0.5625 0.643
0.529 0.714 0.529 0.643
0.556 0.786 0.556 0.714
R 0.526 0.786 0.526 0.714
R 0.55 0.857 R 0.55 0.786
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.8 0.2 0.9
0.3 1 0.3 0.625 0.3 0.8125
0.4 0.833 0.4 0.667 0.4 0.75
0.5 0.667 0.5 0.667 0.5 0.667
0.6 0.667 0.6 0.615 0.6 0.641
0.7 0.526 0.7 0.6 0.7 0.563
0.8 0.55 0.8 0.55 0.8 0.55
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0
Precision Recall Precison Recall
R 1 0.0714 R 1 0.071
R 1 0.143 R 1 0.143
1 0.214 0.667 0.143
R 1 0.286 0.75 0.214
R 1 0.357 R 0.8 0.289
R 0.833 0.429 0.667 0.289
0.714 0.429 0.571 0.289
0.625 0.429 R 0.625 0.357
R 0.667 0.5 R 0.667 0.429
0.6 0.5 0.6 0.429
0.636 0.571 0.636 0.5
R 0.667 0.643 R 0.667 0.571
0.692 0.714 R 0.615 0.571
0.643 0.714 0.571 0.571
0.6 0.714 R 0.6 0.643
0.5625 0.714 0.5625 0.643
0.529 0.714 0.529 0.643
0.556 0.786 0.556 0.714
R 0.526 0.786 0.526 0.714
R 0.55 0.857 R 0.55 0.786
Interpolation Interpolation
Precision precision Average Precision
0 1 0 1 0 1
0.1 1 0.1 1 0.1 1
0.2 1 0.2 0.8 0.2 0.9
0.3 1 0.3 0.625 0.3 0.8125
0.4 0.833 0.4 0.667 0.4 0.75
0.5 0.667 0.5 0.667 0.5 0.667
0.6 0.667 0.6 0.615 0.6 0.641
0.7 0.526 0.7 0.6 0.7 0.563
0.8 0.55 0.8 0.55 0.8 0.55
0.9 0 0.9 0 0.9 0
1 0 1 0 1 0

Figure 2: Yahoo search engine
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.

Average
Google Yahoo
Average Precision Average Precision
0 1 0 1
0.1 1 0.1 1
0.2 0.8 0.2 0.9
0.3 0.7275 0.3 0.8125
0.4 0.6475 0.4 0.75
0.5 0.55225 0.5 0.667
0.6 0.5395 0.6 0.641
0.7 0.263 0.7 0.563
0.8 0.275 0.8 0.55
0.9 0 0.9 0
1 0 1 0
Figure 3: Comparison by average
Google Yahoo
Average Precision Average Precision
0 1 0 1
0.1 1 0.1 1
0.2 0.8 0.2 0.9
0.3 0.7275 0.3 0.8125
0.4 0.6475 0.4 0.75
0.5 0.55225 0.5 0.667
0.6 0.5395 0.6 0.641
0.7 0.263 0.7 0.563
0.8 0.275 0.8 0.55
0.9 0 0.9 0
1 0 1 0
Figure 3: Comparison by average

According to the graph shown above for the comparison of Google and Yahoo, Yahoo performs better
than Google because it has a higher precision and recall
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
than Google because it has a higher precision and recall
Bibliography
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 27 May 2018].
1 out of 18
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
 +13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024  |  Zucol Services PVT LTD  |  All rights reserved.