EEC 220CT - Data & Information Retrieval: Database and Analysis

Verified

Added on 2023/05/31

AI Summary

This assignment solution focuses on data and information retrieval, employing MongoDB for database design due to its scalability and JSON-like document storage capabilities. It details the process of storing and accessing data, highlighting the benefits and drawbacks of using MongoDB, including its flexibility and ease of use versus the lack of join support and high memory usage. The solution also presents SQL statements for table creation and data querying. Furthermore, it simulates MapReduce in R using RSQLite for processing large datasets, including code snippets for creating indexes, loading data, and implementing map and reduce functions for flight data analysis. The assignment concludes with a discussion on monitoring data, emphasizing the importance of indexing for real-time querying and the scalability advantages of MapReduce in database management systems. Desklib offers a wealth of similar assignments and study resources for students.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Running head: DATA AND INFORMATION RETRIEVAL
Data and Information Retrieval
Name of the Student
Name of University
Author’s note:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

1DATA AND INFORMATION RETRIEVAL
Table of Contents
Part 2:.........................................................................................................................................2
2.1 Database Solution:...........................................................................................................2
2.2 Justification of Database:.................................................................................................2
2.3 Storing and Accessing Data:............................................................................................2
2.4 Benefits and Drawback:...................................................................................................3
2.5 Quality of Service:...........................................................................................................4
Part 3:.........................................................................................................................................5
3.1 SQL Statement:................................................................................................................5
3.2 MapReduce Function:......................................................................................................6
3.3 Monitoring Data:............................................................................................................12
Bibliography:............................................................................................................................14

2DATA AND INFORMATION RETRIEVAL
Part 2:
2.1 Database Solution:
Database solution for the selected data set is MongoDB. The MongoDB will be
installed in the Windows or Linux system (Gyorodi et al. 2015).
2.2 Justification of Database:
The main reason of choosing the MongoDB database is the scalability. The reason of
the scalability is the use of JSON-like documents for storing data. The MongoDB database
will allow the NASA EXOPLANET ARCHIVE database to be more robust and efficient
(Kang et al. 2016). The data set described in the database is very similar to the suitable data
sets for MongoDB.
2.3 Storing and Accessing Data:
The database can be downloaded from the server as a CSV file. There are various file
types that can also be downloaded by CSV file type has been chosen because of its suitability
with MongoDB (Hashem et al. 2015). The following screen shot shows the imported data
into Mongo.

3DATA AND INFORMATION RETRIEVAL
The MongoDB shell can be used to run queries for collecting the data. To find the
data from the Planet collect is ‘db.getCollection('Planet').find({})’.
Indexes can be created to find the data more quickly.
db.planet.createIndex( { loc_rowid: -1 } ) is the code for creating a sample index with
loc_rowid.
2.4 Benefits and Drawback:
Using the MongoDB can make the selected database very flexible. A lot more data
can be stored in the MongoDB. The plant data can be accessed more efficiently than other
BigData databases. The MongoDB database is very easy to install and configure. This makes
the database management system very easy to use. The developers can concentrate more on
the solution rather than worrying about the application itself. The codes used in the Planets

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

4DATA AND INFORMATION RETRIEVAL
database will define the schemas (Gyorodi et al. 2015). This has been possible due to the
schema less nature of the MongoDB database. The MongoDB will allow the developers to
use dynamic queries because of the support of document query language. The MongoDB
database does not concentrate on the complex joins to search data which makes the data
accesses very efficient (Hashem et al. 2015). The MongoDB allows good amount of
documentation which will allow the application owner to track the status and progress of the
technical operations.
The main disadvantages of the database is that there is not join support like the
RDBMS. The MongoDB use a lot memory which makes it very costly approach. The
MongoDB database does not have any transaction support. The Map/Reduce function is not
efficient. For making the Map/Reduce function efficient technologies like Hadoop may need
to be considered. The MongoDB updates are very rare and often gets derailed from advance
BigData technologies (Kang et al. 2016). As the Planet database had various columns and
numerous data, the queries for finding data with various conditions can be very complex. The
locking of database is a common phenomenon in the MongoDB solution. Various operations
may require locking the Planet database.
2.5 Quality of Service:
The MongoDB is the best solution when it comes to scalability and availability. As
the selected database is very large and complex, the scalability of the MongoDB will make it
a better choice. The MongoDB is also an open source BigData management system (Gyorodi
et al. 2015). The open source is the kind of computer application which grants the users to
change, study and distribute the application to any person and for any purpose. In an open
source application, the source code is released under such license that the user can use the
application without even buying it. The MongoDB allows the users to modify it as per their
business requirements to best suit the need of business.

5DATA AND INFORMATION RETRIEVAL
Part 3:
3.1 SQL Statement:
SQL Statement to Create Table:
Create Table Flight (
flightMonth int,
DayOfMonth int,
DayOfWeek int,
DepartureTime Time,
ActualDepartureTime Time,
ArrivalTime Time,
Carrier Varchar (40),
FlightNumber Varchar (40),
DepartureDelay Time NULL,
ArrivalDelay Time,
Cancelation Varchar (4),
WeatherDelay Time,
Primary Key (Carrier)
);
SQL Statement to Determine Number of Flights:
Select Count(DepartureDelay) From Flight Where DepartureDelay != NULL;

6DATA AND INFORMATION RETRIEVAL
Table data:
Table structure:
3.2 MapReduce Function:
The solution for the having large chunk of data to be processed in centralized way is
the simulation of MapReduce in R. The data will be produced in CSV format and loaded to
RStudio. First of all the indices will be created to make the search processing much faster and
accurate. To create the indexes the following codes will be used.
create index year on ontime(year);
create index date on ontime(year, month, dayofmonth);

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7DATA AND INFORMATION RETRIEVAL
The best option for using the bigdata in the solution is RSQLite. It is available in both
the windows and linux. If Linux machine is used then the following codes can be used to
install it in the OS.
install.packages("RSQLite")
library(RSQLite)
ontime <- dbConnect("SQLite", dbname = "ontime.sqlite3")
from_db <- function(sql) {
dbGetQuery(ontime, sql)
}
from_db("select count(*), tailnum from ontime group by tailnum")
tails <- from_db("select distinct tailnum from ontime")
Now for loading the database two ways can be used. First is to insert all the year data
into one database file. The other is to have different database file for individual year. The
second solution is much more reliable as the data management can be done very efficiently.
Now the created CSV files can be merged together using the gzip. RSQLite can read gzip
files faster than other compressed files. The following codes can be used load the data.
# load list of all files
flights.files <- list.files(path=flights.folder.path, pattern=”*.csv.gz”)
# read files in data.table
flights <- data.table(read.csv(flights.files[i], stringsAsFactors=F))

8DATA AND INFORMATION RETRIEVAL
Now comes the part of mapping the analysis. For implementing the database a set of
flight database has been selected. The mapping has been done based on that sample data. The
code is as following.
getFlightsStatusByAirlines <- function(flights, yr){
# by Year
if(verbose) cat(“Getting stats for airlines:”, ‘\n’)
airlines.stats <- flights[, list(
dep_airports=length(unique(origin)),
flights=length(origin),
flights_cancelled=sum(cancelled, na.rm=T),
flights_diverted=sum(diverted, na.rm=T),
flights_departed_late=length(which(depdelay > 0)),
flights_arrived_late=length(which(arrdelay > 0)),
total_dep_delay_in_mins=sum(depdelay[which(depdelay > 0)]),
avg_dep_delay_in_mins=round(mean(depdelay[which(depdelay
> 0)])),
median_dep_delay_in_mins=round(median(depdelay[which(depdelay > 0)])),

9DATA AND INFORMATION RETRIEVAL
miles_traveled=sum(distance, na.rm=T)
), by=uniquecarrier][, year:=yr]
#change col order
setcolorder(airlines.stats, c(“year”, colnames(airlines.stats)[-ncol(airlines.stats)]))
#save this data
saveData(airlines.stats, paste(flights.folder.path, “stats/5/airlines_stats_”, yr, “.csv”,
sep=””))
#clear up space
rm(airlines.stats)
# continue.. see git full code
}
The map function is coded in the below section.
#map all calculations
mapFlightStats <- function(){

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

10DATA AND INFORMATION RETRIEVAL
for(j in 1:period) {
yr <- as.integer(gsub(“[^0-9]”, “”, gsub(“(.*)(\\.csv)”, “\\1”, flights.files[j])))
flights.data.file <- paste(flights.folder.path, flights.files[j], sep=””)
if(verbose) cat(yr, “: Reading : “, flights.data.file, “\n”)
flights <- data.table(read.csv(flights.data.file, stringsAsFactors=F))
setkeyv(flights, c(“year”, “uniquecarrier”, “dest”, “origin”, “month”))
# call functions
getFlightStatsForYear(flights, yr)
getFlightsStatusByAirlines(flights, yr)
getFlightsStatsByAirport(flights, yr)
}
}
The reduce function is coded below.
#reduce all results
reduceFlightStats <- function(){
n <- 1:6
folder.path <- paste(“./raw-data/flights/stats/”, n, “/”, sep=””)
print(folder.path)
for(i in n){

11DATA AND INFORMATION RETRIEVAL
filenames <- paste(folder.path[i], list.files(path=folder.path[i], pattern=”*.csv”), sep=””)
dt <- do.call(“rbind”, lapply(filenames, read.csv, stringsAsFactors=F))
print(nrow(dt))
saveData(dt, paste(“./raw-data/flights/stats/”, i, “.csv”, sep=””))
}
}
Figure 1: The Map-Reduce Model of Flight
(Source: Created by Author)
The B141, 1, 28 is the value for courier, month and day attributes. The ‘1’ value
means that the flight took off on the month of January. The value ‘28’ means that on 28th
January the flight took off. The whole data set mans that the flight courier B141 was
scheduled on 28th January.

12DATA AND INFORMATION RETRIEVAL
Figure 2: Map-Reduce Function in RSQLite
(Source: Stander and Dalla 2017, pp. 65)
3.3 Monitoring Data:
The proposed solution is capable of finding any data from the database. The RStudio
server is capable of optionally configuring for auditing entire R console activity through
writing the console output and input to a central location. Each of the data stored in the
BigData database will be accessible through the console of RStudio. The indexes created in
the first place will remain the key to finding any specific data from the data set very quickly.
Because of the use of this, within the latency, the querying can be predictable. The indexing
will allow the database to run in real-time. Real-time means that the database will allow the
user to view the recently entered data before it is even stored in the database itself. The
proposed RStudio database has only local indexing. The indexes are critical for a BigData
Database when the data set size is counted in million or above. The MapReduce function will
allow the database to be more effective and efficient. The MapReduce will allow the
application to divide the tasks into parallel execution manner. The main advantage of
MapReduce in monitoring the data will be scalability of the database management system.
The MapReduce can process petabytes of data and show the potential response to the user

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

13DATA AND INFORMATION RETRIEVAL
within less time. The Delay flight data is included in the MapReduce function. The solution is
very useful for searching this kind of data from the database.

14DATA AND INFORMATION RETRIEVAL
Bibliography:
Bello-Orgaz, G., Jung, J.J. and Camacho, D., 2016. Social big data: Recent achievements and
new challenges. Information Fusion, 28, pp.45-59.
Chen, M., Mao, S. and Liu, Y., 2014. Big data: A survey. Mobile networks and applications,
19(2), pp.171-209.
Cherrie, M.P., Nichols, G., Iacono, G.L., Sarran, C., Hajat, S. and Fleming, L.E., 2018.
Pathogen seasonality and links with weather in England and Wales: A big data time series
analysis. BMC public health, 18(1), p.1067.
Erevelles, S., Fukawa, N. and Swayne, L., 2016. Big Data consumer analytics and the
transformation of marketing. Journal of Business Research, 69(2), pp.897-904.
Frey, M. and Larch, M., 2017. Statistical Learning in the Age of “Big Data” and Machine
Learning.
Gyorodi, C., Gyorodi, R., Pecherle, G. and Olah, A., 2015. A comparative study: MongoDB
vs. MySQL. In Engineering of Modern Electric Systems (EMES), 2015 13th International
Conference on (pp. 1-6). IEEE.
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A. and Khan, S.U., 2015. The
rise of “big data” on cloud computing: Review and open research issues. Information
systems, 47, pp.98-115.
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A. and Khan, S.U., 2015. The
rise of “big data” on cloud computing: Review and open research issues. Information
systems, 47, pp.98-115.

15DATA AND INFORMATION RETRIEVAL
Huh, J.H., Kim, H.B. and Seo, K., 2016. A preliminary analysis model of big data for
prevention of bioaccumulation of heavy metal-based pollutants: Focusing on the atmospheric
data analyses. Adv. Sci. Technol. Lett. SERSC, 129, pp.159-164.
Kang, Y.S., Park, I.H., Rhee, J. and Lee, Y.H., 2016. MongoDB-based repository design for
IoT-generated RFID/sensor big data. IEEE Sensors Journal, 16(2), pp.485-497.
Kang, Y.S., Park, I.H., Rhee, J. and Lee, Y.H., 2016. MongoDB-based repository design for
IoT-generated RFID/sensor big data. IEEE Sensors Journal, 16(2), pp.485-497.
Lopez, V., del Rio, S., Benitez, J.M. and Herrera, F., 2015. Cost-sensitive linguistic fuzzy
rule based classification systems under the MapReduce framework for imbalanced big data.
Fuzzy Sets and Systems, 258, pp.5-38.
Oswald, F.L. and Putka, D.J., 2017. Big data methods in the social sciences. Current opinion
in behavioral sciences, 18, pp.103-106.
Raghupathi, W. and Raghupathi, V., 2014. Big data analytics in healthcare: promise and
potential. Health information science and systems, 2(1), p.3.
Stander, J. and Dalla Valle, L., 2017. On Enthusing Students About Big Data and Social
Media Visualization and Analysis Using R, RStudio, and RMarkdown. Journal of Statistics
Education, 25(2), pp.60-67.