logo

Applying Index Technique to Convert Unstructured Documents into Inverted Index

   

Added on  2023-04-23

11 Pages1836 Words72 Views
 | 
 | 
 | 
COVER PAGE
Applying Index Technique to Convert Unstructured Documents into Inverted Index_1

Q 1
Below is a list of documents in unstructured format that will be used to apply an index technique to
convert them into an inverted index.
Doc 1Information retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources. Searches can be based on full-text or other
content-based indexing.
Doc 2Information retrieval is finding material of an unstructured nature that satisfies an information
need from within large collections.
Doc 3Information systems is the study of complementary networks of hardware and software that
people and organizations use to collect, filter, process, create, and distribute data.
The following steps are followed to create an inverted index.
a. Stop word removal and porters stemming algorithm;
Stop words removal
Removing stop words is the process of eliminating all the terms that are classified as stop words
in all the three documents. This process results to the following documents;
Document 1: Information retrieval activity obtaining information resources relevant information
collection information resources Searches based full-text content-based indexing
Document 2: Information retrieval finding material unstructured nature satisfies information
within large collections
Document 3: Information systems study complementary networks hardware software people
organizations collect filter process create distribute data
Porters stemming algorithm
This algorithm involves removing suffixes from the terms making up the document. Removing
suffixes from the terms making up each document is very useful in information retrieval. In most
cases, terms with a similar stem have the same meaning thus considering a term like;
Connections
Connected
Connection
Connect
Connecting
Considering the terms listed above, in information retrieval, optimal performance is achieved
when terms like the ones stated above are conflated into one term. Conflating the list of terms
listed above is achieved by removing the suffixes from the words resulting to only one term
which will be connect in the case of the list above. Stemming words helps reduce the number of
terms making a document which in turn reduces the complexity and size of the data thus
improving the performance. The porter algorithm was made with the assumption that there is
no stem dictionary and the goal of the task is to improve information retrieval performance.
Applying Index Technique to Convert Unstructured Documents into Inverted Index_2

Applying the stemming algorithm to the documents achieved from removing the stop words will
result to the following documents;
Document 1: Informat retriev activ obtain inform resourc relev inform collect inform resourc
Search base full text content base index
Document 2: Informat retriev find materi unstructur natur satisfi inform within larg collect
Document 3: Informat system studi complementari network hardwar softwar peopl organ
collect filter process creat distribut data
b. Merged inverted list
To create the merged inverted list, the following steps are followed;
1. Taking the final documents achieved after removing stop words and applying porters
stemming algorithm then creating a table showing each term and the document the term is
contained in.
2. The table achieved in step 1 above is then taken and ordered in ascending order depending
on the term.
3. A merged list is created to show within document frequencies of each term as shown in the
table below.
A great tool to perform this steps is Microsoft Excel as it has automated most of the actions for
example ordering the terms in ascending order.
Applying Index Technique to Convert Unstructured Documents into Inverted Index_3

c. Posting file
Applying Index Technique to Convert Unstructured Documents into Inverted Index_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents