Deakin SIT772: Assessment 2 - Information Retrieval Techniques

Verified

Added on  2023/04/24

|12
|1368
|485
Homework Assignment
AI Summary
This assignment solution addresses core concepts in information retrieval, beginning with preprocessing steps such as stop word removal and stemming using Porter's algorithm. It then guides the creation of a merged inverted list and dictionary, demonstrating the indexing process. The solution includes examples of Boolean query design (AND, OR, NOT) and their application to relevant documents. Furthermore, the assignment explores search engine evaluation, comparing Google and Yahoo, and analyzing precision and recall metrics through designed search queries. The solution includes the creation of charts to visualize the results. The references include several research papers on Information Retrieval.
Document Page
Name
Institution
Assessment 2: Information Retrieval Techniques Problem Solving Task
Date
Question one
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
a. Remove all stop words and punctuation, and then apply Porter’s stemming
algorithm to the documents. The list of stop words for this task is provided as
follows: Is, The, Of, To, An, A, From, Can, Be, On, Or, That, Within, And, Use
Answer
In common language words like Is, The, Of, To, An, A, From, Can, Be, On, Or,
That, Within, And, Use are the one referred as stopwords. These words are
usually removed from the list since they do not have important meaning from the
text. This can be achieved using NLTL (Natural Language Tool Kit)( Büttcher at
el 2016).
Before stemming stopwords removal is the first step to take. This is bescuse some
stopwords in a text must be stemmed by a stemmer and are not be filtered by
stopwords. For instance, “from” becomes “fro” by Porter stemmer Stopwords
removal has to be done before stemming. Since some of stopwords in a text
should be stemmed by a stemmer and cannot be filtered by given stopwords
anymore. For example, "from" turns into "fro" by porter stemmerin addition
when one stemmed first before expelling stopwords "fro" remains in the vector
Document Page
after filtering stopwords which has "from" as a stopword. On the other hand, one
can run the same stemmer on the set stopwords and after that procedure with
stopwords on stemmed content.
b. Create a merged inverted list including the within-document frequencies for
each term
Answer
Document to indexed are collected
Senate, President, Citizen So let it to be build
By turning each document into token list, tokenize the text
Producing a list of normalized tokens by doing a linguist preprocessing, which
are indexing terms
...
c. Use the index created in part (b) to create a dictionary and the related posting
file.
Answer
By creating the am inverted index, index the documents that each term occurs in
consisting of a posting and dictionary.
Document Page
Doc 1 Senate, President, Citizen Doc 2 So let it to build
Term DocID Term DocID Term doc.freq Posting Listing
Senate 1 Build 1 Build 1 2
President 1 Citizen 1 Citizen 1 1
Citizen 1 It 2 It 1 2
So 2 Let 2 Let 2 2
To 2 President 1 President 2 1
It 2 Senate 1 Senate 2 1
Build 2 So 2 So 2 2
Let 2 To 2 To 2 2
d. Testing the inverted index by using the following keywords information, system,
index
Answer
Testing inverted index based on time concept PISI offers appreciable properties.
Making it well known and widely used in the field of information retrieval. In
addition it can be included into some existing document level search algorithm
and move it in search ability to a single word level ability and in accordance to
their ability (Ruotsalo, 2015).
First, the PISI is provided with a space break down into single components of
index data structure. Various instances of PISI are stated in one column according
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
to configurations of parameters α and β. If both parameters have high values
hence it means that slower decompression ratio but better compression.
The space breakdown when performed on bible.txt file
Compression ration/ Snippet exchange resultant to various values of parameters
of β and α. The time is given microsecond/character. One curve represents a
snippet with various number words that are decomposed
Document Page
Snippet time/ compression ratio exchange. Time is recorded in
microsecond/character
e. Please design three Boolean queries, (for example, web AND search) and list
the relevant documents for each query.
Answer
A combination of operators such AND, NOT and OR to further produce more
pertinent outcome is what is referred as Boolean query which is type of search
AND: Boolean Query Operator
AND operator is used to narrow the search: All the search terrms in the retrieved
document will be present
Document Page
Query
Apple AND phone AND Battery : in this statement a space is implied and one do
not need to type the word AND . Hence, the statement will appear us
Apple Phone Battery
OR: Boolean Query Operator
This operator broadens the search results by linking several phrases or keywords.
The interpretation of this operator is “at least one is required, more than on or
all”. One nee to use parentheses around OR statements as matter of appropriate
search syntax (Bouadjenek al et 2016).
Apple OR Phone OR Battery
The search outcomes must return at least one the they keywords
NOT: Boolean Query Operator
This operator is used to exclude terms in the search results
“Community college” NOT “Higher education”
Question two
The two search engines that will be used in this section are Google and
Yahoo
a. List your target, results and designed search queries (You can use any
keywords you think are related to the target).
Answer
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
i. For query 1 precision and recall values is 20 documents interpolated to
11 standard recall levels
Hence,
Query 1 Google
Rank 1 2 3 4 5 6 7 8 9 1
Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0
precision 0.5 0.3 0.25 0.1 0.01
ii. For query 1 precision and recall values is 20 documents interpolated to
11 standard recall levels
Query 2 Yahoo
Rank 1 2 3 4 5 6 7 8 9 1
Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0
Precision 0.4 0.3 0.1 0.09
iii. Average of query 1 and query 2
Query 1 Google
Rank 1 2 3 4 5 6 7 8
Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Document Page
precision 0.5 0.3 0.25 0.1 0.01
Query 2 Yahoo
Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Precision 0.4 0.3 0.1 0.09
Average of q1 and q2 0.45 0.3 0.175 0.1 0.01 0.09 #DIV/0! #DIV/0! #D
The chart
1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
Chart Title
Series1 Series2 Series3 Series4
b. List your target, results and designed search queries
Answer
Query 1 Google
Rank 1 2 3 4 5 6 7 8
Recall 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
precision 0.5 0.3 0.25 0.1 0.01
precision 0.4 0.3 0.1 0.09
Average of q1 and q2 0.45 0.3 0.175 0.1 0.01 0.09 #DIV/0! #DIV/0! #D
Document Page
1 2 3 4 5 6
0
0.1
0.2
0.3
0.4
0.5
0.6
Chart Two
Series1 Series2 Series3 Series4
References
Bouadjenek, M.R., Hacid, H. and Bouzeghoub, M., 2016. Social networks and
information retrieval, how are they converging? A survey, a taxonomy and an
analysis of social information retrieval approaches and platforms. Information
Systems, 56, pp.1-18.
Büttcher, S., Clarke, C.L. and Cormack, G.V., 2016. Information retrieval:
Implementing and evaluating search engines. Mit Press.
Ruotsalo, T., Jacucci, G., Myllymäki, P. and Kaski, S., 2015. Interactive intent
modeling: Information discovery beyond search. Communications of the
ACM, 58(1), pp.86-92.
Ye, X., Shen, H., Ma, X., Bunescu, R. and Liu, C., 2016, May. From word
embeddings to document similarities for improved information retrieval in
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
software engineering. In Proceedings of the 38th international conference on
software engineering(pp. 404-415). ACM.
Document Page
chevron_up_icon
1 out of 12
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]