Information Retrieval Engine: Building a Search Engine with Java

Verified

Added on 2023/06/04

AI Summary

This project outlines the development of an information retrieval engine implemented in Java, utilizing the vector space model for indexing and retrieving documents based on keyword queries. The engine consists of several key components, including a main program (MySearchEngine.java) that initializes and executes other modules such as the Searcher, Indexer, and Stemmer. The Searcher tokenizes queries, calculates cosine similarities between documents and queries using tf-IDF, and ranks documents based on relevance. The Indexer processes documents, builds an inverted index with term frequencies and IDF values, and stores this index for efficient retrieval. The Stemmer employs the Porter stemming algorithm to reduce words to their root form, enhancing search accuracy. The project details the implementation plan, including functionalities, test cases, and source file descriptions, aiming to create an effective information retrieval system.

ASSIGNMENT STAGE 1
Introduction
Due to the wide range of advancements in technology, the internet is now flooded with vast
information. Every company and individuals have a piece of information stored on the cloud and other
online sites. Retrieving this information can be difficult. In this assignment, I will develop an information
retrieval engine that will be able to index a collection of documents and will be able to retrieve matching
documents in response to a keyword query. The developed program will retrieve the information using
the vector space model and will be written in Java.
Implementation plan of the functionalities and their test cases.
During the development of the program, various functionalities will be included to efficiently execute
and retrieve the information. The plan involves creating the modules independently based on their
specific function. The source files to be written are.
Programming source files and their functionalities.
1. MySearchEngine.java
The program is called MySearchEngine. This contains the main method to initialize and execute the rest
of the source files. In this code file, the searcher, inverted index, tokenizer, indexer and stemmer source
files are declared. To compile this program, the user needs to execute the javac *.java runtime
command within the source code location or directory.
2. Searcher.java
In this source file, I will design the program to tokenize the raw query as instructed followed by a
calculation of all the cosine similarities for all the documents that contain at least one query term. The
dot-product query is then build up using query term then divided through by vector norms. Statements
to acquire all the documents and their term frequencies for this query term are to be included. The next
step is iterating through each document and make tf-IDF to take dot product then add to the previous
value for a document or initialize the dot products Hashmap.
Then a function to add to query vector norm is implemented in the above if statement. Now
functionalities to build up the cosine similarity scores for each document are written that use a priority

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

queue to automatically store documents in max order. Finally, the source file should print out all the
documents in order of cosine similarity.
3. Indexer.java
This file indexes all the documents stored in collection collection_dir. The process involves constructing
a constructor to scan all the documents first. The following function is the added to debug lineNumber+
+;
String line = scan.nextLine(); This statement reads one line of the text into a string which is then split by
space into a string array using fileLines.add(line.trim());
All the frequencies for all corpus terms are then documented. For any given term in corpus, the number
of documents it is in is shown. The files are in the form “term” that is “filename1, termfreq, fileName2,
termFreq and so on.
the document filename is acquired and iteration through all the tokens in the document is done. A
hashmap of term document frequencies will then be added. A functionality to add to term frequencies
list hashmap will have to be executed to finish building collections from corpus. Write an inverted index
out to file with appended IDF values at the end after iterating through all corpus tokens. The source file
should finally calculate the IDF, the round IDF then build the final string to write to file. Each line should
contain one stop word.
4. Stemmer.java
This class will be used to transform a word into its root form. The input word can be provided a
character at a time by calling add()) or can be done at once by calling one of the various
stem(something) method (Schymik, 2012). After a word has been stemmed it has to be retrieved or
reference to the internal buffer is retrieved. This is to be implemented using the Porter stemmer
(Stemmer, 1980). Each indexed file will have their fields separated by commas, the lines will be
separated by the end line character and all the non-quantity integers will be rounded off to three
integers.
Conclusion.

The program should be able to compute all source files effectively and return the expected results in the
end. All the source files will be saved with a .java extension and their respective classes created after
successful compilation.
References
Schymik, G. (2012). The Impact of Subject Indexes on Semantic Indeterminacy in Enterprise Document
Retrieval. Arizona: Arizona State University.
Stemmer, P. (1980). An Algorithm For Suffix Stripping. The program, 14(3), 130-137. Retrieved from
http://www.tartarus.org/~martin/PorterStemmer