SIT772: Database and Information Retrieval Search Algorithm Report

Verified

Added on 2023/06/04

AI Summary

This report presents a comprehensive solution to a SIT772 assignment on database and information retrieval, focusing on the design and evaluation of search algorithms. The assignment begins with the creation of an inverted index, detailing the steps of removing stop words, applying the Porter stemming algorithm, and merging document tables. It includes sorting words alphabetically, calculating within-document frequencies, constructing a dictionary and posting file, and testing the index. The report then explores Boolean and vector queries, providing examples and calculations for cosine similarity. Furthermore, the assignment evaluates information retrieval (IR) using Google and Bing search engines. The evaluation includes target identification, query design, and graphical comparisons of recall and precision to determine the superior search engine. The report concludes with references to support the analysis.

COVER PAGE (ENTER YOUR DETAILS)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
COVER PAGE (ENTER YOUR DETAILS)..................................................................................................1
Question 1...................................................................................................................................................3
Creating an inverted index......................................................................................................................3
a. Removing stop words......................................................................................................................3
Applying Porter Stemming algorithm..................................................................................................3
 Merged document table..................................................................................................................4
 Sorting words in alphabetical order for all documents................................................................4
 Within document frequencies.....................................................................................................5
 Dictionary and related posting file...............................................................................................6
 Testing.........................................................................................................................................8
Boolean and vector queries.....................................................................................................................8
Question 2 IR evaluation.............................................................................................................................9
Search engines.........................................................................................................................................9
Targets.....................................................................................................................................................9
Search queries.........................................................................................................................................9
Google search Engine graph..............................................................................................................10
Bing Search Engine graph..................................................................................................................11
Average Google and Bing graph........................................................................................................11
References.................................................................................................................................................12

Question 1
Creating an inverted index
The first step is to get documents to use to create the inverted index. These documents will be used to
create the inverted index
 Document 1 (Search engine)
Google ranking systems sort through hundreds of billions of webpages in the Search index to
give you useful and relevant results in a fraction of a second.
 Document 2 (Database)
Database security concerns the use of a broad range of information security controls to protect
databases
 Document 3 (Security and privacy)
Data breaches are on the rise, making information security and privacy top priorities for
business and IT leaders
a. Removing stop words
After removing stop word the new documents become
 Document 1 (Search engine)
Google ranking systems sort hundreds billions webpages Search index useful relevant results
fraction second
 Document 2 (Database)
Database security concerns broad range information security controls protect databases
 Document 3 (security and privacy)
Data breaches rise, making information security privacy top priorities business IT leaders
Applying Porter Stemming algorithm
Stemming the documents without stop words (Brasetvik,2013) results to the following documents
 Document 1 (search engine)
Googl rank system sort hundr billion webpag Search index us relev result fraction second
 Document 2 (database)
Databas secur concern broad rang inform secur control protect databas
 Document 3 (Security and privacy)
Data breach rise make inform secur privaci top prioriti busi IT leader

 Merged document table
Term Document
Googl 1
Rank 1
system 1
sort 1
hundr 1
billion 1
webpag 1
search 1
index 1
Us 1
relev 1
Result 1
Fraction 1
Second 1
databas 2
Secur 3
Concern 2
Broad 2
Rang 2
Inform 2
Secur 2
Control 2
Protect 2
Databas 2
Data 3
Breach 2
Rise 2
Make 3
Inform 3
Secur 3
Privaci 3
Top 3
Prioriti 3
Busi 3
IT 3
leader 3
 Sorting words in alphabetical order for all documents
Term Doc ID
billion 1
Breach 2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Broad 2
Busi 3
Concern 2
Control 2
Data 3
databas 2
Databas 2
Fraction 1
Googl 1
hundr 1
index 1
Inform 2
Inform 3
IT 3
leader 3
Make 3
Prioriti 3
Privaci 3
Protect 2
Rang 2
Rank 1
relev 1
Result 1
Rise 2
search 1
Second 1
Secur 3
Secur 2
Secur 3
sort 1
system 1
Top 3
Us 1
webpag 1
 Within document frequencies
Term Doc ID Frequency
billion 1 1
Breach 2 1
Broad 2 1

Busi 3 1
Concern 2 1
Control 2 1
Data 3 1
databas 2 2
Fraction 1 1
Googl 1 1
hundr 1 1
index 1 1
Inform 2 1
Inform 3 1
IT 3 1
leader 3 1
Make 3 1
Prioriti 3 1
Privaci 3 1
Protect 2 1
Rang 2 1
Rank 1 1
relev 1 1
Result 1 1
Rise 2 1
search 1 1
Second 1 1
Secur 3 2
Secur 2 1
sort 1 1
system 1 1
Top 3 1
Us 1 1
webpag 1 1
 Dictionary and related posting file

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

 Testing
Testing the posting file involves using words from the posting file in a search engine to see whether the
results returned by the search engine are related to the documents making up the posting file. My tests
of the posting file showed that the results returned had some similarities with the documents.
(analyse Ʌ automate Ʌ ¬forum)
Boolean and vector queries
a. Boolean queries
1) (Inform Ʌ database Ʌ ¬forum index)
Results: Doc2 and Doc3
2) (Secure V database)
Results; Doc1 and Doc2
3) (Inform V database) Ʌ secure)
Results: Doc1, Doc2 and Doc3)
b. Vector queries
The query to find cosine similarity in the three documents is;
Query Q= (Inform, Secure, Database)
Document 1
D1= <0, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 0 x 1+1 x 1+0 x 1
√02+ 12+ 02 √12 +12+12 = 1
√ 1 √ 3 = 0.577
Document 2
D2= <1, 2, 2>
Q = <1, 1, 1>
σ ( D2 , Q)= 1 x 1+2 x 1+2 x 1
√ 12 +22 +22 √ 12 +12+ 12 = 5
√9 √3 = 0.962
Document 3
D3= <1, 1, 0>
Q= <1, 1, 1>

σ ( D3 , Q)= 1 x 1+ 0 x 1+0 x 1
√12 +02+ 02 √12 +12+12 = 1
√ 1 √ 3 = 0.577
According to the results of the cosine similarity of the documents with the query, its clear that Doc 2 has
the highest cosine similarity (Vanderbush, 2017) then Doc 1 and Doc 3 have similar cosine similarity. This
means that the document with the highest cosine similarity will appear first when the query is passed in
a search engine then followed by Doc 1 or Doc 3. Introducing another word in the query can help
determine which document will appear second between doc 1 and co 3 depending on which document
the word is contained. Vector queries are better compared to Boolean queries because they show the
order in which the documents will appear in a search engine.
Question 2 IR evaluation
Search engines
My choice of search engines is;
 Google by Google Corp
 Bing by Microsoft
Targets
My target is;
Target 4: obtain the unit guide of SIT774.
Search queries
To test the target above we design two queries that will be used to get recall and precision. The queries
are;
 Query 1= SIT774 unit guide
 Query 2= SIT774 manual

Google search Engine graph
Figure 1: Google Search Engine

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Bing Search Engine graph
Figure 2: Yahoo search engine results for query 1 and query 2
Average Google and Bing graph

Figure 3: Average comparison for the two queries for Google and Bing search engines
According to the graph of the average for the two queries for both Google and Bing search engines its
clear to say that Google is the superior search engine. This is because according to the average for the
two queries Google has a higher recall and is more precise. Recall is the number of results returned
based on the two queries and precision is the number of results that are actually related to the queries
(Rather, 2005). Thus Google is more superior to Bing because it returns more results and the number of
results that are related to the queries is higher for Google compared to Bing
References
Brasetvik, A. (2013). Elasticsearch from the Bottom Up, Part 1. [online] Elastic. Available at:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up [Accessed 21 Sep. 2018].
Rather, R. (2005). Precision and Recall of Five Search Engines for Retrieval of Scholarly Information in the
Field of Biotechnology. [online] Webology.org. Available at:
http://www.webology.org/2005/v2n2/a12.html [Accessed 21 Sep. 2018].
Vanderbush, A. (2017). Algorithmic Stemming in Elasticsearch. [online] Qbox.io. Available at:
https://qbox.io/blog/elasticsearch-algorithmic-stemming-tutorial [Accessed 21 Sep. 2018].