Creating and Testing Inverted Index with Boolean and Vector Models

Verified

Added on 2023/04/23

AI Summary

This assignment provides a comprehensive overview of inverted indexing techniques for information retrieval. It begins by detailing the process of creating an inverted index from a set of unstructured documents, including stop word removal and the application of Porter's stemming algorithm. The assignment then explains the creation of a merged inverted list and a posting file, demonstrating how to test the inverted index using keywords. Furthermore, it explores the Boolean and Vector models, including calculating cosine similarity for document retrieval. The report also includes a comparative analysis of the Boolean and Vector models, highlighting their differences in retrieving and ranking documents. Finally, the assignment evaluates search engine performance using precision and recall metrics, comparing the results of Ask.com and Google for specific queries, ultimately determining that Google provides better results based on the higher precision and recall values.

COVER PAGE

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Q 1
Below is a list of documents in unstructured format that will be used to apply an index technique to
convert them into an inverted index.
Doc 1：Information retrieval is the activity of obtaining information resources relevant to an information
need from a collection of information resources. Searches can be based on full-text or other content-
based indexing.
Doc 2：Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections.
Doc 3：Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index.
a. Stop word removal and porters stemming algorithm;
Stop words removal
Removing stop words is the process of eliminating all the terms that are classified as stop words
in all the three documents. This process results to the following documents;
Document 1: Information retrieval activity obtaining information resources relevant information
collection information resources Searches based full-text content-based indexing
Document 2: Information retrieval finding material unstructured nature satisfies information
within large collections
Document 3: Information systems study complementary networks hardware software people
organizations collect filter process create distribute data
Porters stemming algorithm
This algorithm involves removing suffixes from the terms making up the document. Removing
suffixes from the terms making up each document is very useful in information retrieval. In most
cases, terms with a similar stem have the same meaning thus considering a term like;
Connections
Connected
Connection
Connect
Connecting
Considering the terms listed above, in information retrieval, optimal performance is achieved
when terms like the ones stated above are conflated into one term. Conflating the list of terms
listed above is achieved by removing the suffixes from the words resulting to only one term
which will be connect in the case of the list above. Stemming words helps reduce the number of
terms making a document which in turn reduces the complexity and size of the data thus
improving the performance. The porter algorithm was made with the assumption that there is
no stem dictionary and the goal of the task is to improve information retrieval performance.

Applying the stemming algorithm to the documents achieved from removing the stop words will
result to the following documents;
Document 1: Informat retriev activ obtain inform resourc relev inform collect inform resourc
Search base full text content base index
Document 2: Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3: Informat system studi complementari network hardwar softwar peopl organ
collect filter process creat distribut data
b. Merged inverted list
To create the merged inverted list, the following steps are followed;
1. Taking the final documents achieved after removing stop words and applying porters
stemming algorithm then creating a table showing each term and the document the term is
contained in.
2. The table achieved in step 1 above is then taken and ordered in ascending order depending
on the term.
3. A merged list is created to show within document frequencies of each term as shown in the
table below.
A great tool to perform this steps is Microsoft Excel as it has automated most of the actions for
example ordering the terms in ascending order.

c. Posting file

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Activ 1 1
Base 2 1
Collect 1 2
Complementari 1 3
Content 1 3
Data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
Index 1 1
Inform 5 1
Larg 1 2
3
3 2
Term Frequency Posting

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
relev 1 1
Resourc 1 2
Retriev 1 1
Satisfi 1 2
Search 1 1
Softwar 1 3
1
Studi 1 3
System 1 3

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
d. Testing the inverted index using keywords: information, system, index
To test the posting file using the key words information, system and index using a search engine
should return documents that are related to the posting file (Beiske, 2017). When the posting
file is tested most of the results returned by the search engine for example Google returns
documents related to information systems.
e. Boolean Model
i. Retrieve AND Search
Results=Doc1 & Doc2
ii. Material OR Nature
Results= Doc2
iii. Information AND Retrieve
Results Doc1, Doc2 & Doc3
f. Vector model using cosine similarity
Q= (Information, system, index)
Doc 1
D1 = <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Doc 2
D2= <2, 0, 0>
Q <1, 1, 1>
Text 1 1
Unstructur 1 2
Within 1 2

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Doc 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Boolean queries and vector model comparison
The difference between Boolean queries and vector model is that Boolean queries show
documents that are supposed to be returned based on a certain query but does not show the
order in which the documents will be retrieved while vector model shows the documents that
will be retrieved based on a query and shows the order in which they will be retrieved because it
calculates the cosine similarity of the documents to the query thus the value achieved for each
document can be used to show the order in which the documents are retrieved.
Question 2 IR evaluation
a. Target and designed queries
Search engines
 Ask.com search engine
 Google Search engine
Selected Target
Target 3: obtain the manual of installing tera term
Queries
 Q1= Tera-term installation manual
 Q2= Guide for tera-term istallation
b. List your target, results and designed search queries
Google Search engine

Key
Green ------ = precision
White ------ = recall
Ask

Figure 1: Ask.com search engine
Green ------ = precision
White ------ = recall
Average comparison
Figure 2: average comparison
Key
Green ------ = precision
White ------ = recall

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

According to the average comparison of Ask and Google for the 2 queries, Google is better than Ask.com
because it is more precise as seen with precision values and it has a higher recall value compared to Ask.
The number of documents retrieved by Google for both queries that are related to the search query is
higher than Ask thus making Google better than Ask.
Bibliography
classeval. (n.d.). Introduction to the precision-recall plot. [online] Available at:
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
[Accessed 24 Jan. 2019].
Mikulski, B. (2018). Precision vs. recall - explanation – Bartosz Mikulski. [online] Bartosz
Mikulski. Available at: https://mikulskibartosz.name/precision-vs-recall-explanation-
aada1ec393ec [Accessed 24 Jan. 2019].