APACHE HADOOP VERSUS APACHE SPARK.

Verified

Added on  2023/01/18

|63
|15956
|48
AI Summary
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Running head: APACHE HADOOP VERSUS APACHE SPARK 1
Apache Hadoop Versus Apache Spark
Name
Institutional Affiliation
Course
Date
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 2
Abstract
Big data have acquired huge attention in the past few years. Evaluating big data is a
basic requirement in the modern era, and such requirements are terrifying when assessing
massive data sets. It is very challenging to evaluate the huge amount of data to acquire
different patterns and relevance of data on timely manner. This paper will investigate the Big
Data Analysis concept and discuss two Big Data analytical tools: Apache Spark and Apache
Hadoop.
This paper proposes that Apache Spark should be used. In terms of performance,
apache spark has higher processing speeds and near real time analytics. On the other hand,
Apache Hadoop was designed for batch processing, thus, it does not support real time
processing and is much slower in terms of processing as compared to Apache spark.
Secondly, apache spark is popular for its ease of use because it come with APIs that are user-
friendly build for Spark SQL, Python, Java, and Scala. Apache Spark is also built with an
interactive mode to allow the users and application developers to have immediate response
for actions and queries taken. On the other hand, Apache Hadoop does not have any
interactive elements but only supports add-ons like Pig and Hive. Apache Spark is also
compatible with Apache Hadoop and share all the sources of data that Hadoop uses.
However, because of better performance, Apache is still the preferred option.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 3
Table of Contents
1 Introduction........................................................................................................................5
2 Big Data.............................................................................................................................6
3 Big Data Analytics.............................................................................................................8
4 Big Data Analytic Tools.....................................................................................................9
5 Apache Hadoop..................................................................................................................9
5.1 Evolvement of Apache Hadoop...................................................................................9
5.2 Hadoop Ecosystem/ Architecture..............................................................................10
5.3 Components of Apache Hadoop................................................................................10
5.3.1 HDFS (Hadoop distributed file system).............................................................10
5.3.2 Hadoop MapReduce...........................................................................................12
5.3.3 Hadoop Common...............................................................................................13
5.3.4 Hadoop Yarn......................................................................................................13
5.3.5 Other Hadoop Components................................................................................15
5.4 Hadoop Download.....................................................................................................30
5.5 Types of Hadoop installation.....................................................................................31
5.6 Major Commands of Hadoop....................................................................................31
5.7 Hadoop Streaming.....................................................................................................32
5.8 Reasons to Choose Apache Hadoop..........................................................................32
5.9 Practical Applications of Apache Hadoop................................................................34
5.10 Apache Hadoop Scope...........................................................................................36
6 Apache Spark...................................................................................................................37
6.1 Apache Spark Ecosystem..........................................................................................37
6.2 Components of Apache Spark...................................................................................39
6.2.1 Apache Spark core.............................................................................................39
6.2.2 Apache Spark SQL.............................................................................................40
6.2.3 Apache Spark Streaming....................................................................................40
6.2.4 Apache Spark MLlib..........................................................................................42
6.2.5 Apache Spark GraphX.......................................................................................42
6.2.6 Apache Spark R..................................................................................................42
6.2.7 Scalability Function...........................................................................................43
6.3 Running Spark Applications on a Cluster.................................................................43
6.4 Applications of Apache Spark...................................................................................45
6.5 Practical Applications of Apache Spark....................................................................46
6.6 Reasons to Choose Spark..........................................................................................47
7 Comparison between Apache Hadoop and Apache Spark...............................................49
Document Page
APACHE HADOOP VERSUS APACHE SPARK 4
7.1 The Market Situation.................................................................................................49
7.2 The main Difference Between Apache Hadoop and Apache Spark..........................50
7.3 Examples of Practical Applications...........................................................................51
8 Conclusion........................................................................................................................52
9 References........................................................................................................................54
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 5
1 Introduction
In the current age of computer, people are increasingly depending on technological
devices and almost all aspects of human life, including social, personal and professional are
wholly covered with technology. Almost all of those aspects deal with some kind of data
(Hartung, 2018). As a result of the huge increased in data complexity caused by rapid
increase in variety and speed, new challenges have emerged in the sector of data
management, thus the evolvement of the term Big Data. Analysing, storing, assessing and
securing big data are among the popular terms in the current technological world (Hussain &
Roy, 2016). Big data analysis is a method of collecting data from various resources then
arranging that information in a significant way and then evaluating those big data sets to
uncover important figures and facts from that data collection. This data analysis assists in
identifying hidden figures and facts of data, as well as ranking or categorizing the
information based on the importance it offers (Hoskins, 2014). In summary, big data analysis
is the process of acquiring knowledge from massive variety of data. Organizations such as
twitter processes about 10 thousand tweets per second before broadcasting them to people.
They evaluate all data at a very fast rate to make sure each tweet is according to the set policy
and inhibited words are removed from the tweets. The evaluation process must be carried out
in real time to ensure that there no delays in broadcasting tweets live to the public
(Kirkpatrick, 2013). For instance, enterprises such as Forex Trading evaluate social
information to forecast public trends of the future. To evaluate such large data, it is necessary
to use analytical tools. This paper concentrates on Apache Hadoop and Apache Spark. The
sections of this paper include: literature review that explores the general view of big data and
big data analytics. The paper will also discuss the two leading big data analytical tools;
Apache Scope and Apache Hadoop.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 6
2 Big Data
The availability and exponential growth of massive amount of information with
different variety is referred to as Big Data (Hoskins, 2014). Big Data is a term that is
popularly used in the current automated world and is perceived to be as important to the
society and business as the internet. It is extensively proved and believed that more data
result to more precise assessments, which in turns lead to more timely, legitimate and
confident decision making (Bettencourt, 2014). Better decisions and judgement result in
reduced risk, higher operational efficiencies, and reductions of cost. Researchers of Big Data
picture big data as follows:
Volume-wise: this is a significant factor that has led to the emergence of big data.
volume is increasing to different factors. Governments and companies have been
documenting transactional data for years. Social media is consistently sending automation,
machine-to-machine data, unstructured data, sensors data, among others (Saeed, 2018).
Previously, storage of data was still a problem, however, the emergence of affordable and
advanced storage devices has assisted in addressing the issue of data storage (Bughin, 2016).
Nevertheless, volume still causes other problems such as identifying the significance within
huge data volumes and gathering important information by analysing the data.
Velocity-wise: the rate at which data volume is increasing is becoming critical and it is
challenging to address the issue with efficiency and in time. The need to manage large pieces
of information in real time is brought by the rise of RFID (radio-frequency identification)
tags, robotics, sensors and automation, internet streaming, among other technology facilities
(Catlett & Ghani, 2015). As such, increase in data velocity is among the biggest challenge
being experienced by every big company today.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 7
Variety-wise: although the increase of huge data volume is a huge challenge, data
variety is a bigger problem. Information is increasing in different varieties including,
different formats, unstructured, various file systems, images, financial data, scientific data,
structured, relational and non-relational, videos, multimedia, aviation data, etc (Dhar, 2014).
The issue is finding ways to correlate the various types of data in time to obtain value from
this data. currently, many companies are trying hard to acquire better solutions to the issue.
Variability-wise: the inconsistent trend of the flow of big data is a big challenge.
Social media reaction to events across the globe drives large volumes of information which
requires timely assessments before trend changes (Diesner, 2015). Events across the world
have an impact on financial markets, and this operating cost increase more while handling
unstructured data.
Complexity-wise: huge volumes of data, inconsistent trends, and increasing variety of
data makes big data very challenging. In spite of all the above facts, big data must be sorted
out to correlate, connect and develop useful relational linkages and hierarchies in time before
the information becomes difficult to control (Dumbill, 2013). This illustrates the complexity
involved in today’s big data. in short, any repository of big data with the following features
can be referred to as big data.
Central planning and management
Extensible: primary capabilities can be altered and augmented
Manages huge amounts of data (Zeide, 2017)
Less costly
Offer capabilities for processing data
Accessibility: highly available open source or commercial good with
excellent usability (Hare, 2014)
Distributed repetitive data storage
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 8
Very fast data insertion
Hardware sceptic
Parallel processing of tasks
3 Big Data Analytics
Big data analytics is the practice of employing assessment algorithms operating on
great supporting channels to reveal potentials hidden in big data, such as unknown patterns or
hidden correlations (Tromp, Pechenizkiy & Gaber, 2017). Based on the time required to
process big data, big data analytics can be grouped into two different paradigms.
Batch processing: here, information is first kept and then assessed. The
leading model for batch processing is MapReduce. The basic concept of MapReduce
is that information is first split into small portions (Cercone, F'IEEE, 2015). These
portions are later processed in a distributed and parallel way to create intermediate
outcomes. The end result is acquired by combining all the intermediate outcomes (Al
Jabri, Al-Badi & Ali, 2017). The MapReduce organizes computation resources near
the location of data, which prevents the occurrence of communication cost of
transmitting data. The model is easy to use and is extensively used in web mining,
bioinformatics and machine reading.
Streaming processing: the first thing is to assume that data value relies
on the freshness of data. therefore, the streaming processing model evaluates
information in a timely manner to obtain it outcome. In this model, information is
acquired in a stream. In its constant acquisition, because the stream carries large
volume and is fast, only a small section of the stream is kept in insufficient memory
(Batarseh, Yang & Deng, 2017). The few that passes over the stream are used to attain
Document Page
APACHE HADOOP VERSUS APACHE SPARK 9
approximation results. Streaming processing technology and theory have been
analysed for years. The streaming processing model is employed for online
applications, usually at the millisecond or second level (Bornakke & Due, 2018).
4 Big Data Analytic Tools
There are many big data tools for data evaluation today. However, only two tools will
be discussed in this paper; Apache Spark and Apache Hadoop.
5 Apache Hadoop
Apache Hadoop is an open-source data framework or platform built in Java, devoted
to analyse and store huge amounts of unstructured data (E. Laxmi Lydia & Srinivasa Rao,
2018). Digital mediums are transmitting large amounts of data, as such, new technologies of
big data are emerging at a fast rate. Nevertheless, Apache Hadoop was among the first tool to
be innovated. It enables several simultaneous tasks to execute from one to many servers
without delay (Kim & Lee, 2014). It comprises of a distributed file that permits transmission
of files and data between different nodes in split seconds. Besides, Apache Hadoop has the
ability to process effectively even in cases of node failure (MATACUTA & POPA, 2018).
5.1 Evolvement of Apache Hadoop
Scientists Mike Cafarella and Doug Cutting generated the Hadoop 1.0 platform and
introduced it in 2006 to promote delivery for Nutch search engine. It was inspired by
MapReduce of Google which divides an application into small sections to execute on
different nodes (Mach-Król & Modrzejewska, 2017). The Apache Software foundation
allowed the public to access the tool in 2012. The name of the tool came from Doug Cutting’s
kid yellow soft toy elephant. In the process of its modification, a second improved version
Document Page
APACHE HADOOP VERSUS APACHE SPARK 10
Hadoop 2.30 was launched on 20th February 2014. It consisted of major adjustments in the
architecture.
5.2 Hadoop Ecosystem/ Architecture
Hadoop ecosystem is a framework or platform which assist in addressing the issues of
big data. It consists of various components and services such as storing, maintaining,
ingesting and analysing (Landset, Khoshgoftaar, Richter & Hasanin, 2015). The structural
design can be divided into two parts, that is, Hadoop core and other/ complementary
components.
5.3 Components of Apache Hadoop
The figure below illustrates the key components of Apache Hadoop.
Figure 1: Key Components of Apache Hadoop (Source: Landset, Khoshgoftaar, Richter &
Hasanin, 2015)
5.3.1 HDFS (Hadoop distributed file system)
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 11
HDFS stores information in small memory block and transmits them through a bunch
(Naidu, 2018). Every data is duplicated several times to make sure data is available. HDFS is
the most essential element of Hadoop Ecosystem. It is the basic Hadoop storage system. It is
developed in Java and offers fault tolerant, cost efficient, scalable and dependable data
storage for big data (Mavridis & Karatza, 2017). HDFS executes on product hardware and
comprises of default configuration for multiple installations. Many instances configuration is
required for large sets of data. Hadoop uses shell-like commands to communicate directly
with HDFS.
Components of HDFS
There are two main Hadoop HDFS components; DataNode abd NameNode.
NameNode: it is also referred to as Master node. It does not store
dataset or actual data. instead, it stores Metadata, that is, their location, which
Datanode the information is kept, number of blocks, on which Rack, among other
details. It comprises of directories and files. The work of HDFS NameNode is to
manage namespace of the filesystem, control user’s access to documents and run
execution of file system such closing, naming, opening directories and files (Aung &
Thein, 2014).
DataNode: it is also referred to as Slave. Its work is to store actual data
in HDFS. Datanode carries out the read and write functions as requested by the users.
DataNode replica blocj comprises of two documents on the file system. The first
document is for data and the second document is for documenting sections of
metadata. Data checksums are included in the HDFS Metadata. In the initial set-up,
each Datanode joins with its matching Namenode and begins the interaction.
Validation of Datanode software version and namespace ID happens during the first
interaction. If a mismatch is discovered, Datanode will automatically go down. HDFS
Document Page
APACHE HADOOP VERSUS APACHE SPARK 12
DataNode is responsible for performing operations such as block replica deletion,
creation and duplication according to the NameNode instruction (Hussain & T, 2016).
Another task carried out by DataNode is to monitor data storage of the system.
5.3.2 Hadoop MapReduce
It runs tasks in an equivalent manner by transmitting it as small blocks. Hadoop
MapReduce offers data processing. It is a software structure for writing programs easily that
process the huge amount of unstructured and structured data kept in the HDFS (Singh &
Reddy, 2014). MapReduce applications are similar in nature, therefore are very helpful for
carrying out analysis of huge amount of data using several machines in the set. As such, it
enhances the reliability and speed of cluster.
Figure 2: Hadoop MapReduce (Source: Greeshma & Pradeepini, 2016)
Hadoop MapReduce works by dividing the processing into two stages: Reduce phase
and map phase (Greeshma & Pradeepini, 2016). Each phase incorporates the concept of the
key-value pair as output and input. Besides, programmer particularize the two functions;
reduce function and map function.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 13
The map function takes data set and transforms it into another data set, where
individual components are divided into key/value pairs (tuples). The reduce function, on the
other hand, translates the Map output as input and joins those data tuples using the key and
appropriately adjusts the key value (Lathar & Srinivasa, 2019). The following are some of the
features of MapReduce.
Simplicity: MapReduce workloads are easy to execute. Programs can
be scripted using any language such C++, Java and python (Glybovets & Dmytruk,
2016).
Speed: MapReduce through parallel processing, address problems that
take more than one day to solve in minutes and hours.
Scalability: MapReduce can manage a large amount of data.
Fault tolerance: MapReduce handles failures. If a single copy of data is
not available, another device possesses the same copy of key par that can be utilized
to address the same work.
5.3.3 Hadoop Common
It is a group of common libraries and utilities which manage other modules of
Hadoop. It ensures that the failures in hardware are automatically managed by Hadoop
bunch. The Hadoop Common component is perceived as the core/ base of the structure as it
offers important services and underlying processes such as extraction of the basic operating
system and its file systems. In addition, Hadoop Common has the required JAR (Java
Archive) scripts and files needed to start Hadoop. Besides, the component offers
documentation and source code and a contribution part that incorporates various tasks from
the Hadoop community.
5.3.4 Hadoop Yarn
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 14
It assigns resources which permit various users to run applications without being
bothered about the increased amount of work. Yarn is also referred to as the Hadoop
operating system as it is accountable for monitoring and managing workloads (Huang, Meng,
Zhang & Zhang, 2017). It enables several engines of data processing such as batch processing
and real-time processing to manage data kept on one platform.
Figure 3: Hadoop Yarn Diagram (Source: Huang, Meng, Zhang & Zhang, 2017)
Yarn has been identified as a Hadoop 2 data processing system. The major features of
Yarn include:
Flexibility: enables other models of data processing such as streaming
and interactive. As such, other programs can also be executed together with programs
of MapReduce in Hadoop 2.
Shared: offers a dependable, shared, stable and secure functional
services across several tasks. Other programming paradigms such as iterative
modelling and graph processing have been made possible for data processing.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 15
Efficiency: since multiple programs execute on the same cluster,
Hadoop efficiency increases without affecting the service quality.
5.3.5 Other Hadoop Components
Ambari
This is a web-based interface for configuring, managing and examining big sets of
data to console its elements such as MapReduce, HCatalog, ZooKeeper, Pig, HDFS, Hive,
HBase, Oozie and Sqoop. It offers a support for managing the cluster’s health and permits
evaluating the performance of specific elements such as Pig, MapReduce, Hive, among
others in a user-friendly manner. Management of Hadoop becomes easier since Ambari offer
secure and consistent platform for operational control
Figure 4: Ambari (Source: Landset, Khoshgoftaar, Richter & Hasanin, 2015)
The following are some of the features of Ambari:
Centralized security system: Ambari minimizes the complexity to
configure and administer the security of a cluster across the whole platform.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 16
Complete visibility into the health of the cluster: Ambari makes sure
that the cluster is available and healthy with a complete approach to supervision.
Simplified configuration, installation and management: Ambari
efficiently and easily generate and monitor clusters at scale.
Highly customizable and extensible: Ambari is flexible for
incorporating custom services in the management.
Cassandra
It is an open source distributed database system that is highly scalable built on
NoSQL committed to manage huge amount of data across several product servers, with an
aim of supporting high availability without failure (Fan, Ramaraju, McKenzie, Golab &
Wong, 2015).
Flume
It is a reliable and distributed component for efficiently aggregating, collecting and
transmitting streaming large data into HDFS. Flume is a reliable and fault tolerant tool. The
component of Hadoop Ecosystem permits the flow of data from its origin into Hadoop
ecosystem. It utilizes a simple expandable data model that accommodate online analytical
program (Ranawade, Navale, Dhamal, Deshpande & Ghuge, 2017). Flume assist in instant
acquisition of data from several servers into Hadoop.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 17
Figure 5: Flume (Source: Ranawade, Navale, Dhamal, Deshpande & Ghuge, 2017)
HBase
It is a non-interactive distributed database executing on the Hadoop big data set that
stores huge amount of structured data in tables that could consist of millions of columns and
billions of rows. HBase serve as an input for workloads of MapReduce. It is distributed,
scalable and NoSQL database developed on HDFS top. HBase offer real-time right to use
read or write information in HDFS (Mavani, 2013).
Document Page
APACHE HADOOP VERSUS APACHE SPARK 18
Figure 6: HBase (Source: Mavani, 2013)
There are two components of HBase; RegionServer and HBase Master.
HBase Master: it is not included in the actual data storage. However, it
collaborates in load balancing within RegionServer. It monitors and manages the
Hadoop cluster. Besides, it carries out administration (it has an interface for updating,
generating and deleting tables). HBase master also manages the failover and controls
DDL operation (Xu & Liang, 2013).
Document Page
APACHE HADOOP VERSUS APACHE SPARK 19
RegionServer: it is the employee node which manages read, updates,
write and delete request from customers. RegionServer process executions on each
Hadoop cluster node. It executes on DateNode of HDFS.
Solr
It is a search component that is highly scalable which enables central configuration,
recovery, indexing and failovers. The programs developed using Solr are of high level and
offer high performance (Chen, Xu & Zhu, 2014). Solr assist in identifying needed data from
huge amount of data. Besides, it can be applied for storage reason. The following are some of
the features of Solr.
Restful APIs: to interact with Solr, it is not a must for one to have Java
programming skills. Rather, one can utilize restful services to interact with it (Vis,
2013). Documents in Solr are entered in file formats such as JSON, XML and .CSV
and results are acquired in the same file formats.
Enterprise ready: depending on the requirements of the company, Solr
can be installed in multiple type of systems, small or big, such as distributed,
standalone, cloud, among others.
NoSQL database: Solr can allow distribution of search tasks across a
cluster (Yan, Liu & Lao, 2014).
Highly scalable: the capacity of Solr can be scaled by adding replicas.
Full tect search: Solr offers all skills required for a full text search such
as phrases, wildcard, tokens, spell check and auto-complete.
Extensible and Flexible: by expanding the Java classes and setting it up
appropriately, the Solr components can be customized easily (Cassales, Charão,
Pinheiro, Souveyet & Steffenel, 2015).
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 20
Admin interface: Solr offers a user friendly, easy-to-use, and feature
powered user interface that assist in carrying out all possible workloads such as add,
update, manage logs, delete and search files.
Text centric and organized by importance: Solr is regularly utilized to
search text files and the outcome is provided according to the importance with the
query of the user in order.
Hadoop Sqoop
Sqoop is a tool that transmit large amount of data between structured databases and
Hadoop. It brings data in from external sources into connected components of Hadoop
Ecosystem such as HDFS, Hive, or HBase. Besides, it transfers information from Hadoop to
other external sources (Chen & Jiang, 2015). Sqoop is used with other relational databases
like Netezza, MySQL, Teradata and oracle.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 21
Figure 7: Apache Sqoop (Source: Chen & Jiang, 2015)
Some of the features of Sqoop include:
Brings in consecutive sets of data from mainframe: Sqoop meets the
increasing demand to transfer data from the mainstream to HDFS (Chen, Ko & Yeo,
2015).
Transmit direct to ORC documents: enhances light weight indexing,
compression and query performance.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 22
Parallel transfer of data: this is important for optimal system usage and
faster performance (Lehrack, Duckeck & Ebke, 2014)
Effective data analysis: enhances effectiveness of data evaluation by
merging unstructured data and structured data on a representation on reading data
lake.
Fast copies of data: fast data copies are made from an external source
into Hadoop.
Hadoop Zookeeper
An open source system that organizes and harmonizes distributed systems. Zookeeper
is used for naming, offering group services, maintaining configuration data and offering
distributed management (Okorafor, 2012). It coordinates and manages a huge cluster of
devices.
Figure 8: Hadoop Zookeeper (Source: Okorafor, 2012)
Some of the features of zookeeper include:
Fast: Zookeeper demonstrates high speed in tasks where data reads are
more than writes. The standard ratio of read/ write is 10:1.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 23
Ordered: Zookeeper maintains information of all dealings.
Hcatalog
Hcatalog is a layer of storage management which enables developers to share and
access data. It is a storage and table management tier of Hadoop. Hcatalog supports various
elements of Hadoop ecosystem such as Hive, MapReduce and Pig to easily write and read
data from the cluster. It is core element of Hive that allows the user to keep their information
in any structure and format. Hcatalog supports CSV, Sequence file, RC file, JSON and ORC
file formats by default. Hcatalog is associated with several benefits including:
Provide notifications for data availability
Hcatalog exempts user from data storage cost
Offer visibility for archiving tools and data cleaning
Hadoop Hive
Hive is a warehouse infrastructure of data that performs three major functions: data
query, summarization and analysis (Agarwal, 2018). Hive utilizes language referred to as
HQL (HiveQL), a query language that is similar to SQL. HiveQL interprets queries similar to
those of SQL into MapReduce tasks which will run on Hadoop.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 24
Figure 9: Apache Hadoop Hive (Source: Agarwal, 2018)
The major parts of hive are:
Metastore: this is where the metadata is stored
Query compiler: run HiveQL into DAG (Directed Acyclic Graph)
Driver: control the HiveQl statement lifecycle
Hive server: offer a thrift server.
Hadoop Oozie
Hadoop Oozie is a server-based application that manages and schedules the Hadoop
workloads. It merges several tasks consecutively into a single logical work unit. The
framework of the component is completely connected with YARN and support Hadoop tasks
for Pig, Sqoop, MapReduce and Hive.
Figure 10: Hadoop Oozie (Source: Agarwal, 2018)
Users can develop DAG workflow, which can execute consecutively and in parallel in
Hadoop. Oozie is expandable and handle timely operations of multiple workflow in a Hadoop
cluster. Besides, Oozie is very flexible. Users can easily stop, rerun, start and suspend tasks.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 25
Users can also omit a particular node or re-execute it in Oozie. There are two types of Oozie
tasks:
Oozie workflow: it runs and stores workflows of Hadoop tasks such as
Pig, MapReduce and Hive
Oozie coordinator: it executes workflow tasks based on data
availability and predefined schedules.
Hadoop Pig
Hadoop Pig is a committed high-level tool which is accountable for controlling the
information kept in HDFS (Barskar & Phulre, 2017). It is supported by a MapReduce
compile and a language referred to as Pig Latin. It permits specialists to ETL (extract,
transform and load) the information without writing MapReduce codes. Hadoop Pig loads the
information, uses the needed filters and leaves the information in needed format. Pig needs
Java runtime environment to run programs (rna C & Ansari, 2017).
Figure 11: Hadoop Pig (Source: Barskar & Phulre, 2017)
The features of Pig include:
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 26
Expandability: users can develop their function for performing
specific-purpose processing
Manages all types of data: Pig evaluates both unstructured and
structured data
Optimization opportunities: the system can optimize automatic
operations. This help the user to concentrate om semantics rather than efficiency.
Avro
Avro is component of Hadoop ecosystem and well known for serialization of data. it
is an open source system that offers data exchange and data serialization services for Hadoop
(Plase, Niedrite & Taranovs, 2017). These facilities can be utilized independently or together.
With the help of Avro, Big data can swap applications written in various languages. Data can
be organized into messages or files using serialization service programs. It stores information
in a single file or message allowing programs to easily understand data kept in Avro message
or file.
Avro schema: it depends on schemas foe deserialization/ serialization. Schema is
required by Avro for data read or writes. Avro information is stored together with its schema
in a document. As such, documents can be processed any time by any program.
Dynamic typing: it means deserialization and serialization without generating codes.
It supplements the creation of code available in Avro for fixed typed language as an
alternative for optimization.
The following are some of the features provided by Avro:
It has a remote subroutine (procedure) call
It has a container file to keep continuous data
It has a rich data framework
It has a fast, compact and binary data format
Document Page
APACHE HADOOP VERSUS APACHE SPARK 27
Thrift
Thrift is a software system for extensible development of cross-language services. It
is an interface definition framework for remote procedure call (RPC) communication. To use
Apache Thrift for execution or other operations, Hadoop has to conducts many calls of RPC.
Figure 12: Thrift (Source: Barskar & Phulre, 2017)
Apache Drill
The main goal of Apache Drill is to process large amount of data including semi-
structured and structured data. It is a distributed query machine with low latency designed to
Document Page
APACHE HADOOP VERSUS APACHE SPARK 28
query huge amounts of data and scale multiple nodes (Hausenblas & Nadeau, 2013). The drill
is the first query machine whose model is schema-free. The drill is a helpful component at
cardlytics, an organization that offers end-user purchase information for internet and mobile
banking. Drill is being used at Cardlytics to quickly execute queries and process huge
amounts of records. The drill has a unique memory management system to enhance memory
usage and allocation and to eradicate garbage collection. Drill works with Hive by enabling
developers to re-utilize their current Hive implementation. Some of the features of Apache
Drill include:
Expandability: Drill offers a scalable design at all levels including
query optimization, query layer, and customer API. Any layer can be expended for the
unique need of a company.
Discovery of dynamic schema: It is not mandatory for Apache drill to
have type specification or schema for data in order to conduct the process of query
execution. Rather, drill carries out its data processing in units referred to as record
batches and identifies schema during processing.
Flexibility: drill offers a tiered columnar model for data that can
represent highly flexible and complex data and permit effective processing.
Drill decentralized metadata: drill is different from other SQL
technologies of Hadoop. It lacks the centralized metadata requirement. Users of drill
do not need to manage and generate tables in metadata to carry out data query.
Apache Mahout
Mahout is an open source engine for developing data mining library and expandable
algorithm for machine learning. After data has been stored in Hadoop HDFS, Mahout offers
the data science devices to identify important patterns in those big sets of data. algorithms of
Mahout include:
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 29
Clustering: the item is taken in specific class and organized into groups
that appear to be natural, so that item fitted in the same group are similar to one
another.
Collaborative filtering: it discovers the behaviour of the user and
makes recommendation on a product, for instance, Amazon recommendations.
Classifications: it learns from current grouping and then allocates
uncategorized items to the best group.
Regular pattern mining: it evaluates group items, such as terms in
query session or items in a shopping cart, and pinpoint which elements naturally
appear together.
All the above components of Hadoop ecosystem strengthen the functionality of
Hadoop.
5.4 Hadoop Download
To operate in the environment of Hadoop, the first step is to download Hadoop. This
step can be carried out using any device at no cost since the framework is accessible as an
open source tool. However, there are specific system requirements that need to be met to
ensure the download process is successful. They include:
Hardware requirements: Hadoop can operate on any common hardware
cluster. All that is required is some product hardware.
Operating system (OS) requirement: Hadoop can operate on the
Windows and Unix platforms. The only platform utilized for product requirements is
Linux.
Browser requirement: Hadoop supports most of the famous browsers
such as the Explorer, Google Chrome, Microsoft Internet, Mozilla Firefox, Safari for
Macintosh, Windows and Linux systems based on the need.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 30
Software requirement: because the framework of Hadoop is created
using Java language, the Hadoop software requirement is Java. Java 1.6 is the
minimum Java version (CAO, WU, LIU & ZHANG, 2010).
Database requirement: components of Hadoop ecosystem such as
Hcatalog and Hive uses MySQL database to efficiently run the Hadoop framework.
The new version can be used or allow Apache Ambari to choose on the wizard what is
needed for the same.
5.5 Types of Hadoop installation
There are different ways of running Hadoop. Below are some the different scenarios
foe downloading, installing and running Hadoop.
Standalone mode: although Hadoop is a distributed framework for
managing big data, it can be installed on one node in one standalone mode (Ibrahim &
Bajwa, 2018). As such the whole Hadoop framework operates like a system which is
executing in Java. This is regularly applied for debugging purpose. It is helpful
particularly when examining MapReduce programs on one node before executing on
a large Hadoop cluster.
Completely distributed mode: this is a distributed mode that contains
many commodity hardware nodes joined together to create the Hadoop cluster. In
such an arrangement, the JobTracker, NameNode, and SecondaryNameNode operate
on the master node while the secondarydatanode, TaskTracker and datanode operate
on the slave node.
Pseudodistributed mode: this is a one Java system that executes the
whole Hadoop cluster. As such, the different daemons such as the DataNode,
JobTracker, NameNode and TaskTracker execute on one command of the Java system
to create the distributed Hadoop cluster.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 31
5.6 Major Commands of Hadoop
Hadoop comprises of different file system commands that communicate with each
other to obtain the needed results. They include:
Checksum
Movefromlocal
Hadoop infrastructure.
Appendtofile
Copytolocal
Chgrep
These are the most popular commands applied in Hadoop to carry out different
workloads across the Hadoop platform.
5.7 Hadoop Streaming
Hadoop streaming is the basic API utilized when handling streaming data. Both the
Reducer and Mapper acquire their inputs in typical format. The output is transmitted to
Stdout and the input is acquired from the Stdin. This is the technique to use within Hadoop to
handle consistent data stream and process it in a consistent manner (Gupta, Kumar & Gopal,
2015). Hadoop is the system used for storing and processing big data. Hadoop development
involves the computation of big data using different programming languages such as Scala,
Java, among others. Hadoop supports many types of data such as Char, Decimal, Float,
Boolean, Array, String, Double, among others. Hadoop delivers an essential system called
Hadoop data analytics. Some of the appealing details behind the development of Big Data
Hadoop include:
The HDFS evolved from the Google File system
The MapReduce application was developed to analyse web pages
The HBase evolved from the Google Big Table.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 32
5.8 Reasons to Choose Apache Hadoop
With the growing Big data across the globe, the need for developers of Hadoop is
increasing at a fast speed. The experienced and skilled Hadoop developers with the expertise
of practical execution are highly needed to add value into the current processes. Besides,
there are other main reasons for choosing this technology. They include:
Extensive utilization of Big data: many organizations are recognizing
that in order to manage the explosion of data, they need to employ a technology that
will incorporate such data and bring something valuable and meaningful from it.
Undoubtedly, Hadoop has dealt with this issue and organizations are beginning to
adopt this technology. Besides, a research carried out by Tableau states that among
2200 of clients, around 76 percent participants who have used Hadoop are hoping to
use it in different ways.
Security: recently, security is a major issue of IT infrastructure. As
such, organizations are enthusiastically devoted in the elements of security more than
anything (Sirisha & V.D. Kiran, 2018). Apache sentry, a component of Hadoop
ecosystem, provides authorization element to the information kept in the cluster of big
data (S. & M., 2019).
New technologies are dominating: the big data trend is growing since
the users are requesting for higher speed and therefore are declining the traditional
warehouses of data. by recognizing the customer’s concern, Hadoop is aggressively
incorporating new technologies such as AtScake, Jethro, Cloudera Hadoop Impala,
Actian Vector, among others, in its primary infrastructure.
The eruption of big data has compelled organizations to employ technologies that can
assist them in managing unstructured and complex data in a manner that maximum data could
be analysed and extracted without any delay or loss. This need led to the development of big
Document Page
APACHE HADOOP VERSUS APACHE SPARK 33
data tools that are capable of processing several tasks successfully at once. The following are
some of the features of Hadoop:
Capable of processing and storing sets of data: with the increasing
amounts of data, data failure and loss is likely to increase (Manoj Kumar Danthala &
Dr. Siddhartha Ghosh, 2015). Nevertheless, Hadoop eases the situation since it is
capable of processing and storing complex and huge unstructured sets of data.
Excellent computational capabilities: its distributed computational
version ensures that big data is processed fast with several nodes executing in parallel.
Fewer faults: employing it results to fewer faults. In case of failure in
one node, the workloads are automatically sent to other nodes.
No pre-processing needed: large amount of data can be stored and
reclaimed at the same time, including both unstructured and structured data without
necessarily pre-processing before keeping into the database.
Highly scalable: the cluster size can be increased from one machine to
multiple servers without administering extensively (Suguna, 2016).
Cost-efficient: it is free of any cost and thus require very little money
to implement it.
5.9 Practical Applications of Apache Hadoop
Some of the organizations which have employed Hadoop include:
Twitter
Yahoo
Ebay
Cloudspace
Facebook
LinkedIn
Document Page
APACHE HADOOP VERSUS APACHE SPARK 34
AOL
Alibaba
Both the world and the professionals are currently revolving around data analytics.
Thus, Hadoop will certainly serve as a support for the candidates willing to build a career in
big data analytics. Furthermore, it is suitable for ETL developers, software specialists,
analytics professionals, among others. Nevertheless, a comprehensive knowledge of DBMS,
Java and Linux is an added advantage in the analytics domain.
Large demand for competent specialists: based on an article reported by Forbes in
2015, about 90 percent of companies across the globe are devoting their resources in big data
analytics and approximately one third of companies refer to it as very important. As such, it
can be suggested that Big Data Hadoop is not just a technology but a powerful tool to
organizations making their appearance in the market. Thus, acquiring knowledge of Hadoop
is very crucial for the beginners hoping to be analysts in ten years to come.
More market opportunities: trends in the market provide an upward curve for big data
analytics. They demonstrate that the need for data analysts and scientists will constantly
increase. This evidently points out that acquiring more knowledge of this technology will
guarantee a successful career in any business. Big bucks: statistics shows that the salary of a
Hadoop developer in the united states is $102,000. This evidently shows that learning
Hadoop will provide an opportunity of grabbing the best paying jobs in the world of data
analytics.
Hadoop has captured the market of Big Data by surprise since organizations are
consistently benefiting from its reliability and scalability. Although there are many players in
the market, Apache Hadoop has demonstrated constant advancements thus making it a better
choice for organizations. With the growing number of firms moving towards big data
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 35
analytics, learning Hadoop and being versant with its functionality will, without doubt, guide
a candidate to new career heights.
5.10 Apache Hadoop Scope
With analytical technologies flooding the current market, Hadoop has become famous
and is, without doubt, going to make more meaningful impact in firms. The following proofs
confirms this in a more comprehensible way:
i. According to a survey carried out by markets and markets, the
reliability and efficiency of Hadoop has developed a buzz between software stars.
Based on the report, this technology has grown up to $13.9 billion in 2017, a 54.9
percent greater that its 2012 market size.
ii. Apache Hadoop is in its blossoming phase and its growth is going to
improve in the short and long-term future for the following reasons:
Organizations require a distributed database with the ability to store
huge amount of complex and unstructured data as well as analyse and process the
information to identify important insights
Organizations are willing to devote their resources in this sector.
However, it is necessary to invest in a technology that can be upgraded in lower cost
and is detailed in various ways.
iii. Marketanalysis.com predicted that the market of Hadoop will be
powerful in the following areas between 2017 to 2022:
It will have a robust impact in EMEA, America and Asia Pacific
It will possess its own appliances, hardware, and commercially
supported software, integration, consulting and middleware supports.
It will be used in huge area spectrum such as ETL/data integration,
social media and clickstream analysis, internet of things and mobile devices,
Document Page
APACHE HADOOP VERSUS APACHE SPARK 36
cybersecurity log analysis, predictive/advanced analytics, data mining/visualization,
data warehouse offload, active archive, among others.
6 Apache Spark
Apache spark is a strewn data processing tool and is general-purpose appropriate for
use in a variety of situation. Stream processing, graph computation, machine learning, and
libraries for SQL are on top of the Spark core data which can be utilized together in a system.
Some of the programming languages that apache spark supports include R, Scala, python, and
Java (Hosseini & Kiani, 2018). Data scientists and application developers employ the use of
apache spark in their application so as to quickly transform, analyse, and query large amount
of data. Some of the activities that are mostly related to spark include SQL batch and ETL
tasks across huge amount of data, internet of things (IoT), machine learning activities,
financial systems, and processing data streams from sensors.
The earlier version of apache spark was similar to MapReduce which facilitated
indexing of the huge amount of web content by Google. It was a robust strewn processing
framework. AMPLab developed Apache Spark in 2009 which started off as an incubated
project of Apache Software Foundation (Ko & Won, 2016). Some of the advantages of spark
include faster execution through data caching in memory across several operation running
parallel. Secondly, apache spark has the ability to execute multi-threaded activities inside
java virtual machine (JVM) tasks. Thirdly, it offers a broader functional programming model,
and is useful, particularly for parallel processing of data that is distributed and has iterative
algorithms.
6.1 Apache Spark Ecosystem
Apache spark ecosystem is made up of several components. This section will describe
the various components of Apache spark ecosystem. Some of the languages supported by
Document Page
APACHE HADOOP VERSUS APACHE SPARK 37
apache spark include Spark R, Spark SQL, Spark GraphX, Spark Machine Learning, Spark
Streaming, and Spark SQL. The ecosystem also contains extensible APIs written in various
languages such as R, Java, Python, and Scala which are built on top of the central spark
execution engine (Sherif & Ravindra, 2018). Currently, Apache Spark is one of the most
popular tools for big data analysis and has been described by some experts as one of the next
generation tools that is being utilized by many institutions and organizations with a big
number of contributors and is quickly gaining popularity as an accepted execution engine for
big data (Singh, Anand & B., 2017). The figure below shows spark ecosystem.
Figure 13: Apache Spark Ecosystem (Source: Singh, Anand & B., 2017)
Apache Spark is an open-source and most powerful processing engine that is used as
and substitute for Apache Hadoop. It has increased the productivity of the developers, easy to
use, and is based in high speed processing (Sherif & Ravindra, 2018). Additionally, it
supports graph computation, real-time processing, and machine learning. Additionally, it
offers computing abilities for in-memory and a variety of applications. As mentioned earlier,
it also supports APIs for the different languages. The next section will discuss the various
programming languages supported by Apache Spark in more details.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 38
Scala: apache spark is developed in Scala programming. It supports a number of
amazing features provided by spark. Most of these features are not supported by the other
programming languages.
Python: python has been employed in apache spark to offer great libraries to be used
to analysis of data. When compared with Scala, Python is much slower.
R Language: this language has been integrated in spark to support statistical analysis
and machine learning. In addition, it improves the productivity of the developer. R language
can be used to process data in one machine.
Java: this is a great language to use on spark especially for the developers who have
been using Java+ Hadoop.
6.2 Components of Apache Spark
Apache spark ecosystem have several components of which a number of them is still
under development. A few enhancements are being done regularly to enhance the capabilities
of the application. The following are some of the components empowered by spark
ecosystem (Borisenko, Turdakov & Kuznetsov, 2014).
6.2.1 Apache Spark core
Apache spark core makes up apache spark kernel and is the basis of distributed and
parallel processing (CHEN & SUI, 2018). Spark core is responsible for all the important
input/output operations. Moreover, it entails job scheduling and monitoring over a cluster,
managing the memory efficiently, networking, and fault recovery (Aleksiyants, Borisenko,
Turdakov, Sher & Kuznetsov, 2015). Apache spark core provides high speed processing
because of the in-memory computation capacity. Resilient Distributed Dataset (RDD) is a
special data structure that is used in Spark core. There is need to store data in transitional
stores because of the data reusability and sharing of data. RDD has been employed in spark
Document Page
APACHE HADOOP VERSUS APACHE SPARK 39
core to address the limitation resulting from system slow down caused by data sharing and
reusability by integrating in-memory computation that is fault tolerant. The RDDs are
irreversible, therefore, no changes can be made on them. However, one RDD can be
converted into another RDD and this is achieved through Transformation operation. In
essence, this means that existing RDDs can be used to produce new RDD. Some for of the
primary qualities of spark core include: it supports fault recovery, monitors cluster role,
responsible for all primary input/output operations, improves productivity, and is important in
spark programming.
6.2.2 Apache Spark SQL
This is one of the key components of spark ecosystem. Spark SQL is used in
situations where data is in large volumes to carry out analysis of data that is structured (Kim
& Incheol, 2017). This component can be utilized to produce more information about
computations and data structures. Such information is important in conducting extra system
optimization and calculate the engine output. Spark SQL does not require any special
language to specify the computations. In addition, it facilitates execution of Hive statements
on present Hadoop deployment. One major advantage of Spark SQL is that it simplifies the
process of merging and extracting different datasets. Semi-structured and structured data can
be accessed from spark SQL and is used as a distributed SQL query engine. Some of the
primary features of Spark SQL include: it is fully compatible with HIVE data, offer a
standard way to access numerous sources of data, and support analysis of both semi-
structured and structured data (Kim & Incheol, 2017).
6.2.3 Apache Spark Streaming
Document Page
APACHE HADOOP VERSUS APACHE SPARK 40
Spark streaming is one of the frivolous spark ecosystem components. It enables
developers to carry out data streaming and batch processing easily. Spark streaming employs
the use of continuous stream of input data for real-time processing of data (Boachie & Li,
2019). One of the major benefits of spark streaming is that it supports fast scheduling
capacity. Also, it can be used to conduct streaming analytics by consuming data in little-
batches so as to support transformation process. Some of the key features of spark streaming
include: possibility of integration of historical data and streaming data, enables exactly-once
message assurance, fast, reliable, and easy processing of live streams of data, and allows
inclusion of Spark MLlib for machine learning. Spark streaming is applied in situations that
rapid response and real-time analytics is required for instance cyber security, diagnostics, IoT
sensors, alarms among others. It is also essential in online marketing, supply chain
management, finance, and campaigns. Spark streaming operates in 3 states as illustrated in
the figure below:
Figure 14: How apache streaming operates (Source: Boachie & Li, 2019)
The gathering phase involves identifying in-built stream sources which are
categorized into two: advanced sources and basic sources. Processing phase involves
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 41
employing the use of advanced algorithms with enhanced function where the data gathered is
processed. Data storage involves moving the processed data to live dashboards, databases,
and file systems. Spark streaming supports advance level abstraction.
6.2.4 Apache Spark MLlib
Spark MLlib is among the most essential spark ecosystem components. It provides
high-speed and high-quality algorithms and is a scalable machine learning library. Spark
MLlib supports all types of APIs including python, Scala, and Java (Borisenko, Pastukhov &
Kuznetsov, 2016). It has emerged that is an essential part in big data mining system. It is
compatible with different programming languages, scalable, easy to use, and integrates with
other tools easily. MLlib has enhance the development and deployment of scalable pipelines.
Spark MLlib supports implementation of various algorithms of machine learning including
classification, collaborative filtering, clustering, regression, and decomposition.
6.2.5 Apache Spark GraphX
This component is built on top of the graph computation to allow users to have a
wider reasoning, transform, and build data at a scale. The component already has a library of
the popular algorithms. GraphX is an API used for cross-world computations. It is also
possible to perform classification, clustering through spark GraphX. This component also
enables parallel execution of GraphX and Graph. Apache Spark has in-built graph
computation engine for graphical and graph manipulations.
6.2.6 Apache Spark R
R language has been integrated in apache spark to improve the productivity of the
developer for statistical analysis. R language is used together Apache Spark via SparkR to
Document Page
APACHE HADOOP VERSUS APACHE SPARK 42
handle processing in one machine (Vychuzhanin, 2018). Spark R is beneficial to the users as
it allows them to enable R with the spark power. Some of the benefits of Spark R include it
has facilitates reading data from different sources like JSON files, Hive tables, among others.
Additionally, it receives all the optimizations done on the engine as memory management,
code generation among others.
6.2.7 Scalability Function
Apache spark supports scalability to multiple machines and cores and it has the ability
to run on huge amount of data mostly in terabytes and clusters with many machines
simultaneously (Funika & Koperek, 2016). Moreover, tasks that are processed on data frames
are disperse across the cluster.
6.3 Running Spark Applications on a Cluster
The diagram below illustrates how a spark application runs on a cluster.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 43
Figure 15: Running park Applications on a Cluster
As shown in the figure above, spark applications are executed as independent
processes which are synchronized by Spark session object available in the driver program.
The cluster manager or resource allocates processes to workers, one process for a single
partition. The processes employ its work unit to the dataset available in its partition and
displays a new dataset for a partition. This is due to the fact that iterative algorithms
processes data repeatedly, thus, benefiting from storing datasets across the repetitions.
Finally, the output will be delivered back to the application driver or store in the storage
medium. Apache Spark supports various cluster or resource managers including: Kubernetes,
Apache Hadoop YARN, Apache Mesos, and Apache stand alone. Additionally, Spark has a
local approach where the executors and the driver execute as process threads on the user
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 44
computer rather than a cluster. This is important especially when one wants to develop an
application from a personal workstation.
6.4 Applications of Apache Spark
Apache spark has the ability to process multiple petabytes of data simultaneously
which are distributed across several virtual and physical servers that are integrated.
Additionally, it has a large array of support languages (Scala, R, Python, and Java), APIs, and
developer libraries. The idea that spark is a flexible tool makes it a good option for a variety
of applications or use cases. Apaches spark is mostly applied with common NoSQL databases
like MongoDB, Cassandra, Apache, Apache HBase, MapR-DB, with distributed data stores
like Amazon S3, Hadoop HDFS, and MapR XD, and with distributed messaging stores like
Apache Kfka and MapR-ES. The next section will discuss typical use cases for Apache
Spark.
Stream Processing: application developers are constantly trying to keep up with
streams of data including sensor data, log files among others. Such data comes in streaming
simultaneously from different sources. It is important to process and analyse the data as thy
come even though it is practical to first save these data as they come assess the
retrospectively. For instance, it is important to process data streams from financial activities
as they come in order to detect and prevent transactions that show fraudulent activities.
Machine learning: the accuracy and feasibility of machine learning approaches
continues to increase with increase in volumes of data available. It is now possible to train a
tool or a program to detect and make decision when triggered based on well analysed and
understood datasets prior to using similar solutions to new data which are unknown.
Therefore, Apache spark is more suitable for training machine learning algorithms because it
has the ability to store data in memory and execute iterative queries quickly. The time needed
Document Page
APACHE HADOOP VERSUS APACHE SPARK 45
to go through a set of data is reduced significantly because of the ability to execute same
queries repeatedly and at a scale so as to determine the most ideal algorithm.
Interactive analytics: data scientists and data analysts can use apache spark to analyse
their data by running queries and viewing the outcome, and then changing the initial query or
altering it to generate more results rather than executing queries that are pre-define to develop
dashboards that are static for stock prices, production, or sales line. Spark is best suitable for
such situations because of its ability to quickly adapt and respond.
Data integration: apache spark is being used in several organization to extract, clan,
and standardize data from different sources thus reducing the time and cost required for
extract, transform, and load (ETL) processes.
6.5 Practical Applications of Apache Spark
Several vendors of big data technologies are quickly adopting the apache spark tool
because they have recognized the opportunity which it presents to their existing big data
products thus adding value to them such as machine learning and interactive querying.
Established brands such as Huawei and IBM have invested heavily in the technology.
Moreover, several start-up companies are creating business solution that depend either partly
or wholly on Apache spark. For instance, Berkeley team in 2013, created Databrick founded
on apache spark, that presented a hosted point-to-point data platform that is Spark-driven.
The company receive more than $245 million to fund the project between 2013 and 2017
facilitating the employees to continue improving and enhancing the opensource code for
Apache Spark project (Wang, Zhao, Liu & Lv, 2018).
Furthermore, major vendors of Hadoop like Hortonworks, Cloudera, and MapR have
started to support YARN-based Spark together with their current products. Additionally,
every vendor is working hard to add value to their clients. Moreover, Huawei, IBM, and other
companies have funded apache spark projects to enable them integrate their own products
Document Page
APACHE HADOOP VERSUS APACHE SPARK 46
while supporting extensions and improvement of the Apache project (Kienzler et al., 2018).
Some of the companies that run apache spark systems include Tencent (social networking
company), Taobao (eCommerce company), and Baidu (Chinese search engine). It is reported
that more than 700 terabytes of data were generated every day on Tencent which has more
than 800 million active users which is processed on a cluster of computer nodes (over 8000)
(Guo, Zhao, Zou, Fang & Peng, 2018). Moreover, Novartis, which is a web-based
pharmaceutical company, are running systems that are dependent on apache spark to
minimize the time needed to generate data for researchers while making sure that contractual
safeguards and ethical principles are sustained.
6.6 Reasons to Choose Spark
There are several reasons why businesses and organizations should choose spark.
These reasons have been summarized into three points; simplicity, speed, and support.
Simplicity: the capabilities of apache spark can be accessed through a group of rich
APIs which have been specifically designed for easy and quick interaction with large data
(Mavridis & Karatza, 2017). The APIs are well-structured and documented in a manner that
is simple for application developers and data scientists to put spark to work swiftly.
Speed: apache spark operates both on disk and in memory and has been designed for
speed. Spark has better performance particularly when handling interactive queries of data
that is tore in the memory. As mentioned earlier, data scientists have claimed that spark is
100 time faster that Hadoop MapReduce.
Support: Apache spark has the ability to support a wide variety of programming
languages such as Scala, R, Python, and Java. It also supports close integration with different
common storage solutions such as Apache Cassandra, Apache HBase, Apache Hadoop
HDFS, and MapR (event store, database, and file system) (Mehrotra & Grade, 2019).
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 47
Moreover, the apache spark community is global, active, and large and is well supported by
commercial providers such as Huawei, IBM, Databricks among others.
To conclude this section, one thing about Apache Sparke is the fact that it amplifies
the existing tools rather than creating new solutions. The components of apache spark
ecosystem have made it a popular bigdata tool as compare to other frameworks. This is
because apache spark has the ability to process various data types for instance, graph
processing, structure data processing, real-time analytics among other (Poojitha &
Sowmyarani, 2018). As a result, apache spark has continued to get more attention and is
expected to provide functionality to process ad-hoc queries. Apache spark also replaces
MapReduce by offering iterative processing logic and interactive execution of code using
Scala REPL an Python. However, one can still compile the code using R and Java.
Additionally, this section has identified that apache spark supports several languages
including Spark R, Spark SQL, Spark GraphX, Spark Machine Learning, Spark Streaming,
and Spark SQL.
The ecosystem also contains extensible APIs written in various languages such as R,
Java, Python, and Scala which are built on top of the central spark execution engine. Apache
Spark is an open-source and most powerful processing engine that is used as and substitute
for Apache Hadoop. It has increased the productivity of the developers, easy to use, and is
based in high speed processing. Additionally, it supports graph computation, real-time
processing, and machine learning. Additionally, it offers computing abilities for in-memory
and a variety of applications. Moreover, it has discussed how apache spark supports
scalability to multiple machines and cores and its ability to run on huge amount of data
mostly in terabytes and clusters with many machines simultaneously (Karau, Konwinski &
Wendell, 2015).
Document Page
APACHE HADOOP VERSUS APACHE SPARK 48
Apache spark has the ability to process multiple petabytes of data simultaneously
which are distributed across several virtual and physical servers that are integrated. In the
final parts, the paper has described how several vendors of big data technologies are quickly
adopting the apache spark tool because they have recognized the opportunity which it
presents to their existing big data products thus adding value to them such as machine
learning and interactive querying. Established brands such as Huawei and IBM have invested
heavily in the technology. This section has also elaborated on the reasons why a business or
company should choose Apache Spark.
7 Comparison between Apache Hadoop and Apache Spark
There are many big data analytical tools in the current market. As such, it is
challenging to choose the right one. A standard method of comparing the advantages and
disadvantage of each is likely to ineffective since organizations should examine each
platform from the angle of their specific needs. These context compares the two leading
platforms, Apache Spark and Apache Hadoop.
7.1 The Market Situation
Both Apache Spark and Apache Hadoop are open source frameworks created by
Apache Software Foundation. They are both leading platforms in big data analytics. Hadoop
has been dominating the market of big data for over 5 years. Recent market survey shows that
Hadoop has over 50,000 clients while Spark has more than 10,000 customers. Nevertheless,
Spark gained popularity in 2012 and overpowered Hadoop in a span of one year. A new
growth rate in the number of installations in 2016 and 2017 demonstrates that the trend is still
consistent. The performance of Spark and Hadoop is 47 percent and 14 percent respectively.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 49
7.2 The main Difference Between Apache Hadoop and Apache Spark
The main difference between the two platforms is based on the method of processing.
Hadoop conducts its processing by reading from and writing to a disk while Spark do it in-
memory (LI & YANG, 2016). Therefore, the processing speed varies considerably. Hadoop
can be 100 times slower than Spark. Besides, the amount of data processed varies. Hadoop
support larger sets of data than Spark.
The following are some of the tasks that Hadoop is good for:
Linear processing of large sets of data: Apache Hadoop permits linear
processing of large data sets. It divides a huge amount into smaller portions to be
handled individually on various data nodes and automatically collects the outcome
across several nodes to revert one result. In cases where the resulting set of data is
bigger than available RAM, Hadoop may perform better than Spark.
Economical solution where no instant outcomes are expected: Hadoop
is considered as an excellent solution if the processing speed is not vital. For example,
if data processing can be carried out at night, then the most suitable tool would be
Apache Hadoop.
The following are some of the tasks Apache Spark is good for:
Data processing is fast: Sparks carries out its processing in in-memory
thus making it faster than Hadoop. It is 10 times faster in data storage and 100 times
faster in data processing compared to Hadoop.
Iterative processing: if the job is to process information repeatedly,
Spark overpowers Hadoop. Spark’s RDDs (Resilient Distributed Datasets) allow
several map processes in memory, while Hadoop has to write temporary outcomes to
a disk.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 50
Close to real-time processing: if an organization needs instant insights,
then the most suitable tool should be Spark and its in-memory processing (Prasad,
2018).
Graph processing: Spark is excellent for repetitive computations that
are predictable in graph processing. Besides, Apache Spark has a graph computation
API called GraphX.
Machine learning: Spark has inbuilt machine learning library known as
MLlib, while Hadoop has to look for a third-party to offer it. MLlib has algorithms
that works automatically after or even without any particular configuration,
installation or modification. This algorithm executes in memory. However, if
necessary, a Spark expertise can tune and modify them to fit organizational needs.
Joining sets of data: because of its speed, Spark can generate all
groupings faster. Nevertheless, Hadoop may perform better in cases where large sets
of data is involved and requires a lot of sorting and shuffling.
7.3 Examples of Practical Applications
Different examples of practical applications are analysed to evaluate which
framework outshines the other. The following are some of the examples:
Client segmentation: evaluating client behaviour and pinpointing
customers’ segments that shows similar patterns of behaviour will assist companies to
understand client preferences and develop a special client experience (Zheng, 2015).
Risk management: predicting various potential scenarios can assist
managers to make appropriate decisions by going for non-risky alternatives.
Real-time scam discovery: after historical data is used to train the
system using machine-learning algorithms, it can utilize these discoveries to predict or
analyse inconsistency in real-time that may indicate a potential scam.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 51
Industrial big data analysis: it is about predicting and detecting
inconsistency, however in this case, these inconsistencies are connected to machinery
failures. A system that is properly configured gathers information from sensors to
identify pre-failure conditions.
In all of the above examples, Spark is considered to outperform Hadoop because of its
real-time and fast processing.
8 Conclusion
In conclusion, based on the evidence that this paper has identified, Apache Spark is
better than Apache Hadoop in several aspect. As a result, this paper proposes that Apache
Spark should be used. In terms of performance, apache spark has higher processing speeds
and near real time analytics. On the other hand, Apache Hadoop was designed for batch
processing, thus, it does not support real time processing and is much slower in terms of
processing as compared to Apache spark. Secondly, apache spark is popular for its ease of
use because it come with APIs that are user-friendly build for Spark SQL, Python, Java, and
Scala. Apache Spark is also built with an interactive mode to allow the users and application
developers to have immediate response for actions and queries taken. On the other hand,
Apache Hadoop does not have any interactive elements but only supports add-ons like Pig
and Hive. Apache Spark is also compatible with Apache Hadoop and share all the sources of
data that Hadoop uses. However, because of better performance, Apache is still the preferred
option.
In terms of data processing, apache spark has better processing speeds and power as
compared to apache Hadoop. This is because, apache spark performs simultaneous operations
ins memory using a single step. Apache Spark supports shared secret authentication
(password authentication) which is easier to manage than Kerberos authentication that is used
Document Page
APACHE HADOOP VERSUS APACHE SPARK 52
by Apache Hadoop. Apache spark is a strewn data processing tool and is general-purpose
appropriate for use in a variety of situation. Stream processing, graph computation, machine
learning, and libraries for SQL are on top of the Spark core data which can be utilized
together in a system. Apache spark ecosystem have several components of which a number of
them is still under development. A few enhancements are being done regularly to enhance the
capabilities of the application. The following are some of the components empowered by
spark ecosystem. Spark streaming is one of the frivolous spark ecosystem components. It
enables developers to carry out data streaming and batch processing easily. Spark streaming
employs the use of continuous stream of input data for real-time processing of data. One of
the major benefits of spark streaming is that it supports fast scheduling capacity.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 53
9 References
Agarwal, D. (2018). MAPREDUCE: INSIGHT ANALYSIS OF BIG DATA VIA
PARALLEL DATA PROCESSING USING JAVA PROGRAMMING, HIVE AND
APACHE PIG. International Journal Of Advanced Research In Computer Science, 9(1),
536-540. doi: 10.26483/ijarcs.v9i1.5414
Al Jabri, H., Al-Badi, A., & Ali, O. (2017). Exploring the Usage of Big Data Analytical
Tools in Telecommunication Industry in Oman. Information Resources Management
Journal, 30(1), 1-14. doi: 10.4018/irmj.2017010101
Aleksiyants, A., Borisenko, O., Turdakov, D., Sher, A., & Kuznetsov, S. (2015).
Implementing Apache Spark jobs execution and Apache Spark cluster creation for
Openstack Sahara. Proceedings Of The Institute For System Programming Of
RAS, 27(5), 35-48. doi: 10.15514/ispras-2015-27(5)-3
Aung, O., & Thein, T. (2014). Enhancing NameNode Fault Tolerance in Hadoop Distributed
File System. International Journal Of Computer Applications, 87(12), 41-47. doi:
10.5120/15264-4020
Barskar, A., & Phulre, A. (2017). Opinion Mining of Twitter Data using Hadoop and Apache
Pig. International Journal Of Computer Applications, 158(9), 1-6. doi:
10.5120/ijca2017912854
Batarseh, F., Yang, R., & Deng, L. (2017). A comprehensive model for management and
validation of federal big data analytical systems. Big Data Analytics, 2(1). doi:
10.1186/s41044-016-0017-x
Document Page
APACHE HADOOP VERSUS APACHE SPARK 54
Bettencourt, L. (2014). The Uses of Big Data in Cities. Big Data, 2(1), 12-22. doi:
10.1089/big.2013.0042
Boachie, E., & Li, C. (2019). Big data processing with Apache Spark in university
institutions: spark streaming and machine learning algorithm. International Journal Of
Continuing Engineering Education And Life-Long Learning, 29(1/2), 5. doi:
10.1504/ijceell.2019.099217
Borisenko, O., Pastukhov, R., & Kuznetsov, S. (2016). Deploying Apache Spark virtual
clusters in cloud environments using orchestration technologies. Proceedings Of The
Institute For System Programming Of The RAS, 28(6), 111-120. doi: 10.15514/ispras-
2016-28(6)-8
Borisenko, O., Turdakov, D., & Kuznetsov, S. (2014). Automating cluster creation and
management for Apache Spark. Proceedings Of The Institute For System Programming
Of RAS, 26(4), 33-44. doi: 10.15514/ispras-2014-26(4)-3
Bornakke, T., & Due, B. (2018). Big–Thick Blending: A method for mixing analytical
insights from big and thick data sources. Big Data & Society, 5(1), 205395171876502.
doi: 10.1177/2053951718765026
Bughin, J. (2016). Big data, Big bang?. Journal Of Big Data, 3(1). doi: 10.1186/s40537-015-
0014-3
CAO, N., WU, Z., LIU, H., & ZHANG, Q. (2010). Improving downloading performance in
hadoop distributed file system. Journal Of Computer Applications, 30(8), 2060-2065.
doi: 10.3724/sp.j.1087.2010.02060
Document Page
APACHE HADOOP VERSUS APACHE SPARK 55
Cassales, G., Charão, A., Pinheiro, M., Souveyet, C., & Steffenel, L. (2015). Context-aware
Scheduling for Apache Hadoop over Pervasive Environments. Procedia Computer
Science, 52, 202-209. doi: 10.1016/j.procs.2015.05.058
Catlett, C., & Ghani, R. (2015). Big Data for Social Good. Big Data, 3(1), 1-2. doi:
10.1089/big.2015.1530
Cercone, F'IEEE, N. (2015). What's the big deal about big data?. Big Data And Information
Analytics, 1(1), 31-79. doi: 10.3934/bdia.2016.1.31
Chen, C., & Jiang, S. (2015). Research of the Big Data Platform and the Traditional Data
Acquisition and Transmission based on Sqoop Technology. The Open Automation And
Control Systems Journal, 7(1), 1174-1180. doi: 10.2174/1874444301507011174
Chen, F., Xu, C., & Zhu, Q. (2014). A Design of a Sci-Tech Information Retrieval Platform
Based on Apache Solr and Web Mining. Applied Mechanics And Materials, 530-531,
883-886. doi: 10.4028/www.scientific.net/amm.530-531.883
Chen, L., Ko, J., & Yeo, J. (2015). Analysis of the Influence Factors of Data Loading
Performance Using Apache Sqoop. KIPS Transactions On Software And Data
Engineering, 4(2), 77-82. doi: 10.3745/ktsde.2015.4.2.77
CHEN, M., & SUI, H. (2018). Parallel Entity Resolution with Apache Spark. Destech
Transactions On Engineering And Technology Research, (ecame). doi:
10.12783/dtetr/ecame2017/18462
Dhar, V. (2014). Why Big Data = Big Deal. Big Data, 2(2), 55-56. doi:
10.1089/big.2014.1522
Diesner, J. (2015). Small decisions with big impact on data analytics. Big Data &
Society, 2(2), 205395171561718. doi: 10.1177/2053951715617185
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 56
Dumbill, E. (2013). Making Sense of Big Data. Big Data, 1(1), 1-2. doi:
10.1089/big.2012.1503
E. Laxmi Lydia, D., & Srinivasa Rao, M. (2018). Applying compression algorithms on
hadoop cluster implementing through apache tez and hadoop mapreduce. International
Journal Of Engineering & Technology, 7(2.26), 80. doi: 10.14419/ijet.v7i2.26.12539
Fan, H., Ramaraju, A., McKenzie, M., Golab, W., & Wong, B. (2015). Understanding the
causes of consistency anomalies in Apache Cassandra. Proceedings Of The VLDB
Endowment, 8(7), 810-813. doi: 10.14778/2752939.2752949
Funika, W., & Koperek, P. (2016). SCALING EVOLUTIONARY PROGRAMMING WITH
THE USE OF APACHE SPARK. Computer Science, 17(1), 69. doi:
10.7494/csci.2016.17.1.69
Glybovets, А., & Dmytruk, Y. (2016). The Effectiveness of Programming Languages in the
Apache Hadoop MapReduce Framework. Upravlâûŝie Sistemy I Mašiny, (5 (265), 84-
92. doi: 10.15407/usim.2016.05.084
Greeshma, L., & Pradeepini, G. (2016). Big Data Analytics with Apache Hadoop MapReduce
Framework. Indian Journal Of Science And Technology, 9(26). doi:
10.17485/ijst/2016/v9i26/93418
Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on
Apache Spark. Gigascience. doi: 10.1093/gigascience/giy098
Gupta, P., Kumar, P., & Gopal, G. (2015). Sentiment Analysis on Hadoop with Hadoop
Streaming. International Journal Of Computer Applications, 121(11), 4-8. doi:
10.5120/21582-4651
Document Page
APACHE HADOOP VERSUS APACHE SPARK 57
Hare, J. (2014). Bring it on, Big Data: Beyond the Hype. Big Data, 2(2), 73-75. doi:
10.1089/big.2014.1520
Hartung, T. (2018). Making Big Sense From Big Data. Frontiers In Big Data, 1. doi:
10.3389/fdata.2018.00005
Hausenblas, M., & Nadeau, J. (2013). Apache Drill: Interactive Ad-Hoc Analysis at
Scale. Big Data, 1(2), 100-104. doi: 10.1089/big.2013.0011
Hoskins, M. (2014). Big Data 2.0: Cataclysm or Catalyst?. Big Data, 2(1), 5-6. doi:
10.1089/big.2014.1519
Hoskins, M. (2014). Common Big Data Challenges and How to Overcome Them. Big
Data, 2(3), 142-143. doi: 10.1089/big.2014.0030
Hosseini, B., & Kiani, K. (2018). A Robust Distributed Big Data Clustering-based on
Adaptive Density Partitioning using Apache Spark. Symmetry, 10(8), 342. doi:
10.3390/sym10080342
Huang, W., Meng, L., Zhang, D., & Zhang, W. (2017). In-Memory Parallel Processing of
Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN
Model. IEEE Journal Of Selected Topics In Applied Earth Observations And Remote
Sensing, 10(1), 3-19. doi: 10.1109/jstars.2016.2547020
Hussain, A., & Roy, A. (2016). The emerging era of Big Data Analytics. Big Data
Analytics, 1(1). doi: 10.1186/s41044-016-0004-2
Hussain, G., & T, T. (2016). File Systems and Hadoop Distributed File System in Big
Data. IJARCCE, 5(12), 36-40. doi: 10.17148/ijarcce.2016.51207
Document Page
APACHE HADOOP VERSUS APACHE SPARK 58
Ibrahim, M., & Bajwa, I. (2018). Design and Application of a Multi-Variant Expert System
Using Apache Hadoop Framework. Sustainability, 10(11), 4280. doi:
10.3390/su10114280
Karau, H., Konwinski, A., & Wendell, P. (2015). Learning Spark. O'Reilly Media.
Kienzler, R., Karim, R., Alla, S., Amirghodsi, S., Rajendran, M., Hall, B., & Mei, S.
(2018). Apache Spark 2. Birmingham: Packt Publishing Ltd.
Kim, J., & Incheol, K. (2017). SSQUSAR : A Large-Scale Qualitative Spatial Reasoner
Using Apache Spark SQL. KIPS Transactions On Software And Data Engineering, 6(2),
103-116. doi: 10.3745/ktsde.2017.6.2.103
Kim, S., & Lee, I. (2014). Block Access Token Renewal Scheme Based on Secret Sharing in
Apache Hadoop. Entropy, 16(8), 4185-4198. doi: 10.3390/e16084185
Kirkpatrick, R. (2013). Big Data for Development. Big Data, 1(1), 3-4. doi:
10.1089/big.2012.1502
Ko, S., & Won, J. (2016). Processing large-scale data with Apache Spark. Korean Journal Of
Applied Statistics, 29(6), 1077-1094. doi: 10.5351/kjas.2016.29.6.1077
Landset, S., Khoshgoftaar, T., Richter, A., & Hasanin, T. (2015). A survey of open source
tools for machine learning with big data in the Hadoop ecosystem. Journal Of Big
Data, 2(1). doi: 10.1186/s40537-015-0032-1
Lathar, P., & Srinivasa, K. (2019). A Study on the Performance and Scalability of Apache
Flink Over Hadoop MapReduce. International Journal Of Fog Computing, 2(1), 61-73.
doi: 10.4018/ijfc.2019010103
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
APACHE HADOOP VERSUS APACHE SPARK 59
Lehrack, S., Duckeck, G., & Ebke, J. (2014). Evaluation of Apache Hadoop for parallel data
analysis with ROOT. Journal Of Physics: Conference Series, 513(3), 032054. doi:
10.1088/1742-6596/513/3/032054
LI, Y., & YANG, S. (2016). Integrating Apache Spark and External Data Sources Using
Hadoop Interfaces. Destech Transactions On Engineering And Technology Research,
(ssme-ist). doi: 10.12783/dtetr/ssme-ist2016/3990
Mach-Król, M., & Modrzejewska, D. (2017). ANALYTICAL NEEDS OF POLISH
COMPANIES VS. BIG DATA. Informatyka Ekonomiczna, 2(44), 82-93. doi:
10.15611/ie.2017.2.07
Manoj Kumar Danthala, & Dr. Siddhartha Ghosh. (2015). Bigdata Analysis: Streaming
Twitter Data with Apache Hadoop and Visualizing using BigInsights. International
Journal Of Engineering Research And, V4(05). doi: 10.17577/ijertv4is050643
MATACUTA, A., & POPA, C. (2018). Big Data Analytics: Analysis of Features and
Performance of Big Data Ingestion Tools. Informatica Economica, 22(2/2018), 25-34.
doi: 10.12948/issn14531305/22.2.2018.03
Mavani, M. (2013). Comparative Analysis of Andrew Files System and Hadoop Distributed
File System. Lecture Notes On Software Engineering, 122-125. doi:
10.7763/lnse.2013.v1.27
Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis
with Apache Hadoop and Apache Spark. Journal Of Systems And Software, 125, 133-
151. doi: 10.1016/j.jss.2016.11.037
Document Page
APACHE HADOOP VERSUS APACHE SPARK 60
Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis
with Apache Hadoop and Apache Spark. Journal Of Systems And Software, 125, 133-
151. doi: 10.1016/j.jss.2016.11.037
Mehrotra, S., & Grade, A. (2019). Apache Spark Quick Start Guide. Birmingham: Packt
Publishing Ltd.
Naidu, D. (2018). BIG DATA “WHAT-HOW-WHY” AND ANALYTICAL TOOLS FOR
HYDROINFORMATICS. International Journal Of Advanced Multidisciplinary
Scientific Research, 1(4), 37-47. doi: 10.31426/ijamsr.2018.1.4.214
Okorafor, E. (2012). Availability Of JobTracker Machine In Hadoop/MapReduce Zookeeper
Coordinated Clusters. Advanced Computing: An International Journal, 3(3), 19-30. doi:
10.5121/acij.2012.3302
Plase, D., Niedrite, L., & Taranovs, R. (2017). A Comparison of HDFS Compact Data
Formats: Avro Versus Parquet. Mokslas - Lietuvos Ateitis, 9(3), 267-276. doi:
10.3846/mla.2017.1033
Poojitha, G., & Sowmyarani, C. (2018). Pipeline for Real-time Anomaly Detection in Log
Data Streams using Apache Kafka and Apache Spark. International Journal Of
Computer Applications, 182(24), 8-13. doi: 10.5120/ijca2018917942
Prasad, K. (2018). Real-time Data Streaming using Apache Spark on Fully Configured
Hadoop Cluster. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL
SCIENCES, 13(5). doi: 10.26782/jmcms.2018.12.00013
Ranawade, S., Navale, S., Dhamal, A., Deshpande, K., & Ghuge, C. (2017). Online
Analytical Processing on Hadoop using Apache Kylin. International Journal Of Applied
Information Systems, 12(2), 1-5. doi: 10.5120/ijais2017451682
Document Page
APACHE HADOOP VERSUS APACHE SPARK 61
rna C, S., & Ansari, Z. (2017). Apache Pig - A Data Flow Framework Based on Hadoop Map
Reduce. International Journal Of Engineering Trends And Technology, 50(5), 271-275.
doi: 10.14445/22315381/ijett-v50p244
S., R., & M., M. (2019). Approval of Data in Hadoop Using Apache Sentry. International
Journal Of Computer Sciences And Engineering, 7(1), 583-586. doi:
10.26438/ijcse/v7i1.583586
Saeed, F. (2018). Towards quantifying psychiatric diagnosis using machine learning
algorithms and big fMRI data. Big Data Analytics, 3(1). doi: 10.1186/s41044-018-0033-
0
Sherif, A., & Ravindra, A. (2018). Apache Spark deep learning cookbook. Birmingham:
Packt Publishing Ltd.
Singh, D., & Reddy, C. (2014). A survey on platforms for big data analytics. Journal Of Big
Data, 2(1). doi: 10.1186/s40537-014-0008-6
Singh, P., Anand, S., & B., S. (2017). Big Data Analysis with Apache Spark. International
Journal Of Computer Applications, 175(5), 6-8. doi: 10.5120/ijca2017915251
Sirisha, N., & V.D. Kiran, K. (2018). Authorization of Data In Hadoop Using Apache
Sentry. International Journal Of Engineering & Technology, 7(3.6), 234. doi:
10.14419/ijet.v7i3.6.14978
Suguna, S. (2016). Improvement of HADOOP Ecosystem and Their Pros and Cons in Big
Data. International Journal Of Engineering And Computer Science. doi:
10.18535/ijecs/v5i5.57
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
APACHE HADOOP VERSUS APACHE SPARK 62
Tromp, E., Pechenizkiy, M., & Gaber, M. (2017). Expressive modeling for trusted big data
analytics: techniques and applications in sentiment analysis. Big Data Analytics, 2(1).
doi: 10.1186/s41044-016-0018-9
Vis, F. (2013). A critical reflection on Big Data: Considering APIs, researchers and tools as
data makers. First Monday, 18(10). doi: 10.5210/fm.v18i10.4878
Vychuzhanin, V. (2018). Distributed software complex on the basic former apache spark for
processing the flow big data from complex technical systems. INFORMATICS AND
MATHEMATICAL METHODS IN SIMULATION, 8(2), 146-155. doi:
10.15276/imms.v8.no2.146
Wang, Z., Zhao, Y., Liu, Y., & Lv, C. (2018). A speculative parallel simulated annealing
algorithm based on Apache Spark. Concurrency And Computation: Practice And
Experience, 30(14), e4429. doi: 10.1002/cpe.4429
Xu, J., & Liang, J. (2013). Research on a Distributed Storage Application with
HBase. Advanced Materials Research, 631-632, 1265-1269. doi:
10.4028/www.scientific.net/amr.631-632.1265
Yan, L., Liu, S., & Lao, D. (2014). Solr Index Optimization Based on MapReduce. Applied
Mechanics And Materials, 556-562, 3506-3509. doi:
10.4028/www.scientific.net/amm.556-562.3506
Zeide, E. (2017). The Structural Consequences of Big Data-Driven Education. Big
Data, 5(2), 164-172. doi: 10.1089/big.2016.0061
Zheng, Z. (2015). Introduction to Big Data Analytics and the Special Issue on Big Data
Methods and Applications. Journal Of Management Analytics, 2(4), 281-284. doi:
10.1080/23270012.2015.1116414
Document Page
APACHE HADOOP VERSUS APACHE SPARK 63
chevron_up_icon
1 out of 63
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]