APACHE HADOOP VERSUS APACHE SPARK.

Added on 2023-01-18

63 Pages15956 Words48 Views

Running head: APACHE HADOOP VERSUS APACHE SPARK 1
Apache Hadoop Versus Apache Spark
Name
Institutional Affiliation
Course
Date

APACHE HADOOP VERSUS APACHE SPARK 2
Abstract
Big data have acquired huge attention in the past few years. Evaluating big data is a
basic requirement in the modern era, and such requirements are terrifying when assessing
massive data sets. It is very challenging to evaluate the huge amount of data to acquire
different patterns and relevance of data on timely manner. This paper will investigate the Big
Data Analysis concept and discuss two Big Data analytical tools: Apache Spark and Apache
Hadoop.
This paper proposes that Apache Spark should be used. In terms of performance,
apache spark has higher processing speeds and near real time analytics. On the other hand,
Apache Hadoop was designed for batch processing, thus, it does not support real time
processing and is much slower in terms of processing as compared to Apache spark.
Secondly, apache spark is popular for its ease of use because it come with APIs that are user-
friendly build for Spark SQL, Python, Java, and Scala. Apache Spark is also built with an
interactive mode to allow the users and application developers to have immediate response
for actions and queries taken. On the other hand, Apache Hadoop does not have any
interactive elements but only supports add-ons like Pig and Hive. Apache Spark is also
compatible with Apache Hadoop and share all the sources of data that Hadoop uses.
However, because of better performance, Apache is still the preferred option.

APACHE HADOOP VERSUS APACHE SPARK 3
Table of Contents
1 Introduction........................................................................................................................5
2 Big Data.............................................................................................................................6
3 Big Data Analytics.............................................................................................................8
4 Big Data Analytic Tools.....................................................................................................9
5 Apache Hadoop..................................................................................................................9
5.1 Evolvement of Apache Hadoop...................................................................................9
5.2 Hadoop Ecosystem/ Architecture..............................................................................10
5.3 Components of Apache Hadoop................................................................................10
5.3.1 HDFS (Hadoop distributed file system).............................................................10
5.3.2 Hadoop MapReduce...........................................................................................12
5.3.3 Hadoop Common...............................................................................................13
5.3.4 Hadoop Yarn......................................................................................................13
5.3.5 Other Hadoop Components................................................................................15
5.4 Hadoop Download.....................................................................................................30
5.5 Types of Hadoop installation.....................................................................................31
5.6 Major Commands of Hadoop....................................................................................31
5.7 Hadoop Streaming.....................................................................................................32
5.8 Reasons to Choose Apache Hadoop..........................................................................32
5.9 Practical Applications of Apache Hadoop................................................................34
5.10 Apache Hadoop Scope...........................................................................................36
6 Apache Spark...................................................................................................................37
6.1 Apache Spark Ecosystem..........................................................................................37
6.2 Components of Apache Spark...................................................................................39
6.2.1 Apache Spark core.............................................................................................39
6.2.2 Apache Spark SQL.............................................................................................40
6.2.3 Apache Spark Streaming....................................................................................40
6.2.4 Apache Spark MLlib..........................................................................................42
6.2.5 Apache Spark GraphX.......................................................................................42
6.2.6 Apache Spark R..................................................................................................42
6.2.7 Scalability Function...........................................................................................43
6.3 Running Spark Applications on a Cluster.................................................................43
6.4 Applications of Apache Spark...................................................................................45
6.5 Practical Applications of Apache Spark....................................................................46
6.6 Reasons to Choose Spark..........................................................................................47
7 Comparison between Apache Hadoop and Apache Spark...............................................49

APACHE HADOOP VERSUS APACHE SPARK 4
7.1 The Market Situation.................................................................................................49
7.2 The main Difference Between Apache Hadoop and Apache Spark..........................50
7.3 Examples of Practical Applications...........................................................................51
8 Conclusion........................................................................................................................52
9 References........................................................................................................................54

APACHE HADOOP VERSUS APACHE SPARK 5
1 Introduction
In the current age of computer, people are increasingly depending on technological
devices and almost all aspects of human life, including social, personal and professional are
wholly covered with technology. Almost all of those aspects deal with some kind of data
(Hartung, 2018). As a result of the huge increased in data complexity caused by rapid
increase in variety and speed, new challenges have emerged in the sector of data
management, thus the evolvement of the term Big Data. Analysing, storing, assessing and
securing big data are among the popular terms in the current technological world (Hussain &
Roy, 2016). Big data analysis is a method of collecting data from various resources then
arranging that information in a significant way and then evaluating those big data sets to
uncover important figures and facts from that data collection. This data analysis assists in
identifying hidden figures and facts of data, as well as ranking or categorizing the
information based on the importance it offers (Hoskins, 2014). In summary, big data analysis
is the process of acquiring knowledge from massive variety of data. Organizations such as
twitter processes about 10 thousand tweets per second before broadcasting them to people.
They evaluate all data at a very fast rate to make sure each tweet is according to the set policy
and inhibited words are removed from the tweets. The evaluation process must be carried out
in real time to ensure that there no delays in broadcasting tweets live to the public
(Kirkpatrick, 2013). For instance, enterprises such as Forex Trading evaluate social
information to forecast public trends of the future. To evaluate such large data, it is necessary
to use analytical tools. This paper concentrates on Apache Hadoop and Apache Spark. The
sections of this paper include: literature review that explores the general view of big data and
big data analytics. The paper will also discuss the two leading big data analytical tools;
Apache Scope and Apache Hadoop.

APACHE HADOOP VERSUS APACHE SPARK 6
2 Big Data
The availability and exponential growth of massive amount of information with
different variety is referred to as Big Data (Hoskins, 2014). Big Data is a term that is
popularly used in the current automated world and is perceived to be as important to the
society and business as the internet. It is extensively proved and believed that more data
result to more precise assessments, which in turns lead to more timely, legitimate and
confident decision making (Bettencourt, 2014). Better decisions and judgement result in
reduced risk, higher operational efficiencies, and reductions of cost. Researchers of Big Data
picture big data as follows:
Volume-wise: this is a significant factor that has led to the emergence of big data.
volume is increasing to different factors. Governments and companies have been
documenting transactional data for years. Social media is consistently sending automation,
machine-to-machine data, unstructured data, sensors data, among others (Saeed, 2018).
Previously, storage of data was still a problem, however, the emergence of affordable and
advanced storage devices has assisted in addressing the issue of data storage (Bughin, 2016).
Nevertheless, volume still causes other problems such as identifying the significance within
huge data volumes and gathering important information by analysing the data.
Velocity-wise: the rate at which data volume is increasing is becoming critical and it is
challenging to address the issue with efficiency and in time. The need to manage large pieces
of information in real time is brought by the rise of RFID (radio-frequency identification)
tags, robotics, sensors and automation, internet streaming, among other technology facilities
(Catlett & Ghani, 2015). As such, increase in data velocity is among the biggest challenge
being experienced by every big company today.

APACHE HADOOP VERSUS APACHE SPARK 7
Variety-wise: although the increase of huge data volume is a huge challenge, data
variety is a bigger problem. Information is increasing in different varieties including,
different formats, unstructured, various file systems, images, financial data, scientific data,
structured, relational and non-relational, videos, multimedia, aviation data, etc (Dhar, 2014).
The issue is finding ways to correlate the various types of data in time to obtain value from
this data. currently, many companies are trying hard to acquire better solutions to the issue.
Variability-wise: the inconsistent trend of the flow of big data is a big challenge.
Social media reaction to events across the globe drives large volumes of information which
requires timely assessments before trend changes (Diesner, 2015). Events across the world
have an impact on financial markets, and this operating cost increase more while handling
unstructured data.
Complexity-wise: huge volumes of data, inconsistent trends, and increasing variety of
data makes big data very challenging. In spite of all the above facts, big data must be sorted
out to correlate, connect and develop useful relational linkages and hierarchies in time before
the information becomes difficult to control (Dumbill, 2013). This illustrates the complexity
involved in today’s big data. in short, any repository of big data with the following features
can be referred to as big data.
 Central planning and management
 Extensible: primary capabilities can be altered and augmented
 Manages huge amounts of data (Zeide, 2017)
 Less costly
 Offer capabilities for processing data
 Accessibility: highly available open source or commercial good with
excellent usability (Hare, 2014)
 Distributed repetitive data storage

APACHE HADOOP VERSUS APACHE SPARK 8
 Very fast data insertion
 Hardware sceptic
 Parallel processing of tasks
3 Big Data Analytics
Big data analytics is the practice of employing assessment algorithms operating on
great supporting channels to reveal potentials hidden in big data, such as unknown patterns or
hidden correlations (Tromp, Pechenizkiy & Gaber, 2017). Based on the time required to
process big data, big data analytics can be grouped into two different paradigms.
 Batch processing: here, information is first kept and then assessed. The
leading model for batch processing is MapReduce. The basic concept of MapReduce
is that information is first split into small portions (Cercone, F'IEEE, 2015). These
portions are later processed in a distributed and parallel way to create intermediate
outcomes. The end result is acquired by combining all the intermediate outcomes (Al
Jabri, Al-Badi & Ali, 2017). The MapReduce organizes computation resources near
the location of data, which prevents the occurrence of communication cost of
transmitting data. The model is easy to use and is extensively used in web mining,
bioinformatics and machine reading.
 Streaming processing: the first thing is to assume that data value relies
on the freshness of data. therefore, the streaming processing model evaluates
information in a timely manner to obtain it outcome. In this model, information is
acquired in a stream. In its constant acquisition, because the stream carries large
volume and is fast, only a small section of the stream is kept in insufficient memory
(Batarseh, Yang & Deng, 2017). The few that passes over the stream are used to attain

End of preview

Want to access all the pages? Upload your documents or become a member.

Business Intelligence Using Big Data

|16

|4212

|71

Synergic Adsorption for Heavy Metal and Dye Removal using Zeolite Clinoptilolite Powder - Desklib

|85

|15297

|219

Basic Concept of Information Technology Assignment

|677

|36

Easy Document Management: A Guide to Benefits, Features, and Selection

|46

|20333

|429

APACHE HADOOP VERSUS APACHE SPARK.

End of preview

Business Intelligence Using Big Datalg...

Synergic Adsorption for Heavy Metal and Dye Removal using Zeolite Clinoptilolite Powder - Deskliblg...

Basic Concept of Information Technology Assignmentlg...

Easy Document Management: A Guide to Benefits, Features, and Selectionlg...

Business Intelligence Using Big Data

Synergic Adsorption for Heavy Metal and Dye Removal using Zeolite Clinoptilolite Powder - Desklib

Basic Concept of Information Technology Assignment

Easy Document Management: A Guide to Benefits, Features, and Selection