SIT772 - Database and Information Retrieval: Assignment 2

Verified

Added on  2023/03/21

|25
|2043
|81
Homework Assignment
AI Summary
This document presents a comprehensive solution to a data science assignment focusing on information retrieval techniques. The assignment begins with stemming and stop word removal applied to three documents, followed by the creation of a merged inverted list with within-document frequencies and a dictionary. The core of the solution explores the Boolean and vector models for query processing. The Boolean model utilizes logical operators to retrieve documents, while the vector model employs cosine similarity to rank documents based on relevance. The document then evaluates the performance of Google and Bing search engines by analyzing their results for queries related to the price of a new Xbox One. The analysis includes identifying relevant and irrelevant documents returned by each search engine, highlighting the differences in precision between the two engines. The assignment demonstrates the practical application of information retrieval concepts and provides a comparative analysis of different search engine approaches.
Document Page
COVER PAGE
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Question 1
1a) Results of removing the stop words
Document 1
Data science interdisciplinary field scientific methods, processes, algorithms systems extract
knowledge insights data various forms, structured unstructured.
Document 2
Data mining process discovering patterns large data sets involving methods intersection
machine learning, statistics, database systems
Document 3
Information systems study complementary networks hardware software people organizations
collect, filter, process, create, distribute data
(condition) S1 S2
Step 1a
SSES -> SS
Processes Process
IES -> I
SS -> SS
Process-> process
S ->
Algorithms Algorithm
Systems System
Insights Insight
Forms form
Patterns -> pattern
Sets set
Methods method
Statistics statistic
Document Page
Networks network
Organizations organization
Step 1b
(*v*) ED ->
Structured structur
Unstructured->unstructur
(*v*) ING ->
Discovering discover
Involving involve
Learning learn
(m=1 and *o) -> E
Knowledge knowledg
Large larg
Machine machin
Hardware hardwar
Software softwar
Distribute distribut
Step 1c
(*v*) Y -> I
Interdisciplinary -> interdisciplinari
Complementary->complementari
Study->studi
Document Page
Step 2
(m>0) ATION -> ATE
Information->informate
(m>0) IZATION -> IZE
Organization->organize
(m>0) IVITI -> IVE
Activiti->active
Step 3
Step 4
(m>1) AL ->
(m>1) ATE ->
Informate->inform
(m>1 and (*S or *T)) ION ->
Intersection Intersect
(m>1) IZE ->
(m>1) ANT ->
Step 5a
(m>1) E ->
Knowledge knowledg
Large larg
Machine machin
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Hardware hardwar
Software softwar
Distribute distribut
(m=1 and not *o) E ->
Create->creat
People->peopl
Large->larg
Searche->search
Stemmed documents
Document 1
Data scienc interdisciplinari field scientif method process algorithm system extract knowledg
insight data variou form structur unstructur
Document 2
Data mine process discov pattern larg data set involv method intersect machin learn statist
databas system
Document 3
Informat system studi complementari network hardwar softwar peopl organ collect filter
process creat distribut data
1b) Merged inverted list including within-document frequencies
Document Page
Merged Sorted List with within document frequency
Term DocumentFrequency
algorithm 1 1
collect 3 1
complementari 3 1
creat 3 1
Data 1 2
Data 2 2
data 3 1
databas 2 1
discov 2 1
distribut 3 1
extract 1 1
field 1 1
filter 3 1
form 1 1
hardwar 3 1
informat 3 1
insigt 1 1
interdisciplinari 1 1
intersect 2 1
involv 2 1
knowledg 1 1
larg 2 1
learn 2 1
machin 2 1
method 1 1
method 2 1
mine 2 1
network 3 1
organ 3 1
pattern 2 1
peopl 3 1
process 1 1
process 2 1
process 3 1
Scienc 1 1
scientif 1 1
set 2 1
softwar 3 1
statist 2 1
structur 1 1
studi 3 1
system 1 1
system 2 1
system 3 1
unstructur 1 1
variou 1 1
Document Page
1c) dictionary related Posting file
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Document Page
1d) testing the inverted index
Keywords: Data, system and algorithm
Data
Inverted Index (Data)
System
Inverted index (System)
Algorithm
Document Page
Based on the dictionary file, the first of the three words is algorithm which is contained only
once in document 1. The second word is data and is contained in all the three documents but it
first appears two times in document 1, two times in document 2 and once in document 3. The
last word is system and appears once in all the three documents.
Boolean model and vector model
a. Boolean Model queries
1) method Ʌ process Ʌ System
This query returns all documents
2) Database Ʌ Algorithm
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
This query returns Document 1 and 2
3) method V process V System
Document Page
This query returns all documents
b. Vector model using cosine similarity
The similarity threshold is:
Q (Data, Data,System, Algorithm)
Document 1
D1= <2, 1, 1>
Q= <2, 1, 1>
chevron_up_icon
1 out of 25
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]