ISYS114 Computing Assignment: Hadoop - Functionality and Applications

Verified

Added on 2023/06/03

AI Summary

This report provides a detailed overview of Hadoop, focusing on its functionality and applications in the context of big data processing. It explains how Hadoop addresses the challenges of volume, velocity, and variety in data management, highlighting its origins as a counterweight to Google's BigTable. The report delves into Hadoop's two main systems: the Hadoop Distributed File System (HDFS) and the MapReduce engine, detailing how HDFS distributes storage across multiple machines for cost-effectiveness and reliability, while MapReduce filters, sorts, and processes data. It further elaborates on Hadoop's distributed processing approach, emphasizing the roles of Namenode and Datanode in HDFS. The report also outlines the advantages of Hadoop, including its scalability, cost-effectiveness, and speed in data processing.

Computing

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

COMPUTING
1
What is Hadoop? How does it work?
In the big data world, it can be seen that volume, velocity and volume of the data renders the
technologies which are ineffective. So, to overcome their helplessness organizations like Google
and Yahoo analyze the solutions to focus on controlling all the information which were gathered
by the servers in a proper and cost effective way.
It can be examined that Hadoop was made by the Yahoo engineer, Doug Cutting as a counter
weight to Google big table. Yahoo focused on considering Hadoop so that the issue related to big
data can be breakdown into few pieces that can be parallel. It can be stated that Hadoop is
considered as the open source project which is available under Apache License 2.0 and now it
can be considered to manage the big chunks of data in a proper manner by the organizations. It
can be analyzed that Hadoop has two main systems which are Hadoop distributed file system in
which it can be seen that storage system is spread out over multiple machines which helps to
reduce cost and enhances the overall reliability (Mackenzie, 2015).
The next main system is related to Map Reduce engine in which the algorithm filters, sort and
also considers the use of the database input in many way. So, it can be stated that it is one of the
important factor that can help to manage the overall data in a proper manner. It can be stated that
in the workplace, it is important that Hadoop adopter should be more sophisticated then the
relational database adopter. Hadoop deployment is also tricky but it can be seen that vendors can
easily create the application so that there issues can be resolved easily (Barr and Stonebraker,
2015).
There are various advantages which are related to Hadoop. It can be analyzed that Hadoop is
considered as the highly scalable storage platform as it gathers and also distribute the big data
sets across the hundreds of the inexpensive servers that are managed in a parallel way. It can also
been seen that it is one of the cost effective method that can be considered by the companies so
that it can be easy to gather the information in a flexible manner. It is also considered as the fast
concept through which the information can be gathered easily and in a proper manner
(Uzunkaya, Ensari and Kavurucu, 2015).
How Hadoop works?

COMPUTING
2
It can be seen that Hadoop focuses on working with distributed processing for big data which is
set across the overall cluster of commodity servers and also it works with the various machines.
In this it has been seen that to process the data, the clients should focus on submitting the data
and program to Hadoop and HDFS stores the information while Mapreduce process the overall
information. As, it can be stated that HDFS is considered as the storing element of Hadoop then
there are two daemons that focus on running for HDFS. It can be stated that Namenode runs on
the master mode and Datanodel runs on the slaves. Namenode gather the Meta information and
datanode gather the exact data (Arora, 2015).
Also, it can be seen that Hadoop is considered as the ecosystem of libraries and the focus is
given on the distributed processing for the huge information. It is basically a software library
which is framework that to gather the overall distributed processing of the large information
which is set across the clusters of the computers which are considered as the simple
programming models. It is also create to scale up from the individual servers to many machines
which focus on offering computation and storage (Padhy, Maharana and Parlakhemundi, 2016).
In this system there are also many numbers of machines which are not shared with any memory
and disks. It simply means that it can be easy to purchase the whole bundle of commodity servers
and can also help in running the Hadoop software. In the centralized database system, there is a
large disk which is linked with four or eight or 16 processors. Also, in the Hadoop system there
are various servers which have two or eight CPUs. It can be easy to run the indexing job by
sending the code to many servers in the cluster and also the server is managed on the limited
data. The outcomes are also showcased through which it can be easy to map the activities out of
all the servers and then it minimizes the outcome back into the particular outcome set (Qiao et
al., 2015). So, this should be considered by the company for gathering the data.

COMPUTING
3
References
Arora, N. (2015). Hadoop: Components and Working. International Journal of Advanced
Research in Computer Science, 6(7), pp.85-90.
Barr, V., & Stonebraker, M. (2015). A valuable lesson, and whither Hadoop?. Communications
of the ACM, 58(1), 18-19.
Mackenzie, A. (2015). The production of prediction: What does machine learning
want?. European Journal of Cultural Studies, 18(4-5), 429-445.
Padhy, S. P., Maharana, S. B., & Parlakhemundi, G. (2016). Hadoop File Management
System. International Journal of Engineering and Management Research (IJEMR), 6(5),
281-286.
Qiao, L., Li, Y., Takiar, S., Liu, Z., Veeramreddy, N., Tu, M., . & Botev, C. (2015). Gobblin:
Unifying data ingestion for Hadoop. Proceedings of the VLDB Endowment, 8(12), 1764-
1769.
Uzunkaya, C., Ensari, T., & Kavurucu, Y. (2015). Hadoop ecosystem and its analysis on
tweets. Procedia-Social and Behavioral Sciences, 195, 1890-1897.