Applying Index Technique to Convert Unstructured Documents into Inverted Index
VerifiedAdded on 2023/04/23
|11
|1836
|72
AI Summary
Learn how to apply index technique to convert unstructured documents into inverted index. Understand stop word removal, porters stemming algorithm, merged inverted list, posting file, and Boolean model. Get insights into IR evaluation and precision-recall plot.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
COVER PAGE
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Q 1
Below is a list of documents in unstructured format that will be used to apply an index technique to
convert them into an inverted index.
Doc 1:Information retrieval is the activity of obtaining information resources relevant to an information
need from a collection of information resources. Searches can be based on full-text or other content-
based indexing.
Doc 2:Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections.
Doc 3:Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index.
a. Stop word removal and porters stemming algorithm;
Stop words removal
Removing stop words is the process of eliminating all the terms that are classified as stop words
in all the three documents. This process results to the following documents;
Document 1: Information retrieval activity obtaining information resources relevant information
collection information resources Searches based full-text content-based indexing
Document 2: Information retrieval finding material unstructured nature satisfies information
within large collections
Document 3: Information systems study complementary networks hardware software people
organizations collect filter process create distribute data
Porters stemming algorithm
This algorithm involves removing suffixes from the terms making up the document. Removing
suffixes from the terms making up each document is very useful in information retrieval. In most
cases, terms with a similar stem have the same meaning thus considering a term like;
Connections
Connected
Connection
Connect
Connecting
Considering the terms listed above, in information retrieval, optimal performance is achieved
when terms like the ones stated above are conflated into one term. Conflating the list of terms
listed above is achieved by removing the suffixes from the words resulting to only one term
which will be connect in the case of the list above. Stemming words helps reduce the number of
terms making a document which in turn reduces the complexity and size of the data thus
improving the performance. The porter algorithm was made with the assumption that there is
no stem dictionary and the goal of the task is to improve information retrieval performance.
Below is a list of documents in unstructured format that will be used to apply an index technique to
convert them into an inverted index.
Doc 1:Information retrieval is the activity of obtaining information resources relevant to an information
need from a collection of information resources. Searches can be based on full-text or other content-
based indexing.
Doc 2:Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections.
Doc 3:Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index.
a. Stop word removal and porters stemming algorithm;
Stop words removal
Removing stop words is the process of eliminating all the terms that are classified as stop words
in all the three documents. This process results to the following documents;
Document 1: Information retrieval activity obtaining information resources relevant information
collection information resources Searches based full-text content-based indexing
Document 2: Information retrieval finding material unstructured nature satisfies information
within large collections
Document 3: Information systems study complementary networks hardware software people
organizations collect filter process create distribute data
Porters stemming algorithm
This algorithm involves removing suffixes from the terms making up the document. Removing
suffixes from the terms making up each document is very useful in information retrieval. In most
cases, terms with a similar stem have the same meaning thus considering a term like;
Connections
Connected
Connection
Connect
Connecting
Considering the terms listed above, in information retrieval, optimal performance is achieved
when terms like the ones stated above are conflated into one term. Conflating the list of terms
listed above is achieved by removing the suffixes from the words resulting to only one term
which will be connect in the case of the list above. Stemming words helps reduce the number of
terms making a document which in turn reduces the complexity and size of the data thus
improving the performance. The porter algorithm was made with the assumption that there is
no stem dictionary and the goal of the task is to improve information retrieval performance.
Applying the stemming algorithm to the documents achieved from removing the stop words will
result to the following documents;
Document 1: Informat retriev activ obtain inform resourc relev inform collect inform resourc
Search base full text content base index
Document 2: Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3: Informat system studi complementari network hardwar softwar peopl organ
collect filter process creat distribut data
b. Merged inverted list
To create the merged inverted list, the following steps are followed;
1. Taking the final documents achieved after removing stop words and applying porters
stemming algorithm then creating a table showing each term and the document the term is
contained in.
2. The table achieved in step 1 above is then taken and ordered in ascending order depending
on the term.
3. A merged list is created to show within document frequencies of each term as shown in the
table below.
A great tool to perform this steps is Microsoft Excel as it has automated most of the actions for
example ordering the terms in ascending order.
result to the following documents;
Document 1: Informat retriev activ obtain inform resourc relev inform collect inform resourc
Search base full text content base index
Document 2: Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3: Informat system studi complementari network hardwar softwar peopl organ
collect filter process creat distribut data
b. Merged inverted list
To create the merged inverted list, the following steps are followed;
1. Taking the final documents achieved after removing stop words and applying porters
stemming algorithm then creating a table showing each term and the document the term is
contained in.
2. The table achieved in step 1 above is then taken and ordered in ascending order depending
on the term.
3. A merged list is created to show within document frequencies of each term as shown in the
table below.
A great tool to perform this steps is Microsoft Excel as it has automated most of the actions for
example ordering the terms in ascending order.
c. Posting file
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Activ 1 1
Base 2 1
Collect 1 2
Complementari 1 3
Content 1 3
Data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
Index 1 1
Inform 5 1
Larg 1 2
3
3 2
Term Frequency Posting
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Activ 1 1
Base 2 1
Collect 1 2
Complementari 1 3
Content 1 3
Data 1 3
Distribut 1 3
Filter 1 3
Find 1 2
Full 1 1
Hardwar 1 3
Index 1 1
Inform 5 1
Larg 1 2
3
3 2
Term Frequency Posting
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
relev 1 1
Resourc 1 2
Retriev 1 1
Satisfi 1 2
Search 1 1
Softwar 1 3
1
Studi 1 3
System 1 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Materi 1 2
Natur 1 2
Network 1 3
Obtain 1 1
Organ 1 3
Peopl 1 3
Process 1 3
relev 1 1
Resourc 1 2
Retriev 1 1
Satisfi 1 2
Search 1 1
Softwar 1 3
1
Studi 1 3
System 1 3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
d. Testing the inverted index using keywords: information, system, index
To test the posting file using the key words information, system and index using a search engine
should return documents that are related to the posting file (Beiske, 2017). When the posting
file is tested most of the results returned by the search engine for example Google returns
documents related to information systems.
e. Boolean Model
i. Retrieve AND Search
Results=Doc1 & Doc2
ii. Material OR Nature
Results= Doc2
iii. Information AND Retrieve
Results Doc1, Doc2 & Doc3
f. Vector model using cosine similarity
Q= (Information, system, index)
Doc 1
D1 = <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Doc 2
D2= <2, 0, 0>
Q <1, 1, 1>
Text 1 1
Unstructur 1 2
Within 1 2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
d. Testing the inverted index using keywords: information, system, index
To test the posting file using the key words information, system and index using a search engine
should return documents that are related to the posting file (Beiske, 2017). When the posting
file is tested most of the results returned by the search engine for example Google returns
documents related to information systems.
e. Boolean Model
i. Retrieve AND Search
Results=Doc1 & Doc2
ii. Material OR Nature
Results= Doc2
iii. Information AND Retrieve
Results Doc1, Doc2 & Doc3
f. Vector model using cosine similarity
Q= (Information, system, index)
Doc 1
D1 = <3, 1, 0>
Q= <1, 1, 1>
3 x 1+1 x 1+ 0 x 1
√ 32+12 +02 √ 12+12+12 = 4
√ 7 √ 3 = 1.15
Doc 2
D2= <2, 0, 0>
Q <1, 1, 1>
Text 1 1
Unstructur 1 2
Within 1 2
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
2 x 1+0 x 1+0 x 1
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Doc 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Boolean queries and vector model comparison
The difference between Boolean queries and vector model is that Boolean queries show
documents that are supposed to be returned based on a certain query but does not show the
order in which the documents will be retrieved while vector model shows the documents that
will be retrieved based on a query and shows the order in which they will be retrieved because it
calculates the cosine similarity of the documents to the query thus the value achieved for each
document can be used to show the order in which the documents are retrieved.
Question 2 IR evaluation
a. Target and designed queries
Search engines
Ask.com search engine
Google Search engine
Selected Target
Target 3: obtain the manual of installing tera term
Queries
Q1= Tera-term installation manual
Q2= Guide for tera-term istallation
b. List your target, results and designed search queries
Google Search engine
√22 +02+ 02 √12 +12+12 = 2
√4 √3 = 0.76
Doc 3
D= <1, 1, 0>
Q= <1, 1, 1>
σ ( D3 , Q)= 1 x 1+ 1 x 1+ 0 x 1
√12 +12 +02 √12+12 +12 = 2
√2 √3 = 1.07
Boolean queries and vector model comparison
The difference between Boolean queries and vector model is that Boolean queries show
documents that are supposed to be returned based on a certain query but does not show the
order in which the documents will be retrieved while vector model shows the documents that
will be retrieved based on a query and shows the order in which they will be retrieved because it
calculates the cosine similarity of the documents to the query thus the value achieved for each
document can be used to show the order in which the documents are retrieved.
Question 2 IR evaluation
a. Target and designed queries
Search engines
Ask.com search engine
Google Search engine
Selected Target
Target 3: obtain the manual of installing tera term
Queries
Q1= Tera-term installation manual
Q2= Guide for tera-term istallation
b. List your target, results and designed search queries
Google Search engine
Key
Green ------ = precision
White ------ = recall
Ask
Green ------ = precision
White ------ = recall
Ask
Figure 1: Ask.com search engine
Green ------ = precision
White ------ = recall
Average comparison
Figure 2: average comparison
Key
Green ------ = precision
White ------ = recall
Green ------ = precision
White ------ = recall
Average comparison
Figure 2: average comparison
Key
Green ------ = precision
White ------ = recall
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
According to the average comparison of Ask and Google for the 2 queries, Google is better than Ask.com
because it is more precise as seen with precision values and it has a higher recall value compared to Ask.
The number of documents retrieved by Google for both queries that are related to the search query is
higher than Ask thus making Google better than Ask.
Bibliography
classeval. (n.d.). Introduction to the precision-recall plot. [online] Available at:
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
[Accessed 24 Jan. 2019].
Mikulski, B. (2018). Precision vs. recall - explanation – Bartosz Mikulski. [online] Bartosz
Mikulski. Available at: https://mikulskibartosz.name/precision-vs-recall-explanation-
aada1ec393ec [Accessed 24 Jan. 2019].
because it is more precise as seen with precision values and it has a higher recall value compared to Ask.
The number of documents retrieved by Google for both queries that are related to the search query is
higher than Ask thus making Google better than Ask.
Bibliography
classeval. (n.d.). Introduction to the precision-recall plot. [online] Available at:
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
[Accessed 24 Jan. 2019].
Mikulski, B. (2018). Precision vs. recall - explanation – Bartosz Mikulski. [online] Bartosz
Mikulski. Available at: https://mikulskibartosz.name/precision-vs-recall-explanation-
aada1ec393ec [Accessed 24 Jan. 2019].
1 out of 11
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.