APACHE HADOOP VERSUS APACHE SPARK2 Abstract Big data have acquired huge attention in the past few years. Evaluating big data is a basic requirement in the modern era, and such requirements are terrifying when assessing massive data sets. It is very challenging to evaluate the huge amount of data to acquire different patterns and relevance of data on timely manner. This paper will investigate the Big Data Analysis concept and discuss two Big Data analytical tools: Apache Spark and Apache Hadoop. This paper proposes that Apache Spark should be used. In terms of performance, apache spark has higher processing speeds and near real time analytics. On the other hand, Apache Hadoop was designed for batch processing, thus, it does not support real time processing and is much slower in terms of processing as compared to Apache spark. Secondly, apache spark is popular for its ease of use because it come with APIs that are user- friendly build for Spark SQL, Python, Java, and Scala. Apache Spark is also built with an interactive mode to allow the users and application developers to have immediate response for actions and queries taken. On the other hand, Apache Hadoop does not have any interactive elements but only supports add-ons like Pig and Hive. Apache Spark is also compatible with Apache Hadoop and share all the sources of data that Hadoop uses. However, because of better performance, Apache is still the preferred option.
APACHE HADOOP VERSUS APACHE SPARK3 Table of Contents 1Introduction........................................................................................................................5 2Big Data.............................................................................................................................6 3Big Data Analytics.............................................................................................................8 4Big Data Analytic Tools.....................................................................................................9 5Apache Hadoop..................................................................................................................9 5.1Evolvement of Apache Hadoop...................................................................................9 5.2Hadoop Ecosystem/ Architecture..............................................................................10 5.3Components of Apache Hadoop................................................................................10 5.3.1HDFS (Hadoop distributed file system).............................................................10 5.3.2Hadoop MapReduce...........................................................................................12 5.3.3Hadoop Common...............................................................................................13 5.3.4Hadoop Yarn......................................................................................................13 5.3.5Other Hadoop Components................................................................................15 5.4Hadoop Download.....................................................................................................30 5.5Types of Hadoop installation.....................................................................................31 5.6Major Commands of Hadoop....................................................................................31 5.7Hadoop Streaming.....................................................................................................32 5.8Reasons to Choose Apache Hadoop..........................................................................32 5.9Practical Applications of Apache Hadoop................................................................34 5.10Apache Hadoop Scope...........................................................................................36 6Apache Spark...................................................................................................................37 6.1Apache Spark Ecosystem..........................................................................................37 6.2Components of Apache Spark...................................................................................39 6.2.1Apache Spark core.............................................................................................39 6.2.2Apache Spark SQL.............................................................................................40 6.2.3Apache Spark Streaming....................................................................................40 6.2.4Apache Spark MLlib..........................................................................................42 6.2.5Apache Spark GraphX.......................................................................................42 6.2.6Apache Spark R..................................................................................................42 6.2.7Scalability Function...........................................................................................43 6.3Running Spark Applications on a Cluster.................................................................43 6.4Applications of Apache Spark...................................................................................45 6.5Practical Applications of Apache Spark....................................................................46 6.6Reasons to Choose Spark..........................................................................................47 7Comparison between Apache Hadoop and Apache Spark...............................................49
APACHE HADOOP VERSUS APACHE SPARK4 7.1The Market Situation.................................................................................................49 7.2The main Difference Between Apache Hadoop and Apache Spark..........................50 7.3Examples of Practical Applications...........................................................................51 8Conclusion........................................................................................................................52 9References........................................................................................................................54
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK5 1Introduction In the current age of computer, people are increasingly depending on technological devices and almost all aspects of human life, including social, personal and professional are wholly covered with technology. Almost all of those aspects deal with some kind of data (Hartung, 2018). As a result of the huge increased in data complexity caused by rapid increase in variety and speed, new challenges have emerged in the sector of data management, thus the evolvement of the term Big Data. Analysing, storing, assessing and securing big data are among the popular terms in the current technological world(Hussain & Roy, 2016). Big data analysis is a method of collecting data from various resources then arranging that information in a significant way and then evaluating those big data sets to uncover important figures and facts from that data collection. This data analysis assists in identifying hidden figures and facts of data, as well as ranking or categorizing the information based on the importance it offers(Hoskins, 2014). In summary, big data analysis is the process of acquiring knowledge from massive variety of data. Organizations such as twitter processes about 10 thousand tweets per second before broadcasting them to people. They evaluate all data at a very fast rate to make sure each tweet is according to the set policy and inhibited words are removed from the tweets. The evaluation process must be carried out in real time to ensure that there no delays in broadcasting tweets live to the public (Kirkpatrick, 2013). For instance, enterprises such as Forex Trading evaluate social information to forecast public trends of the future. To evaluate such large data, it is necessary to use analytical tools. This paper concentrates on Apache Hadoop and Apache Spark. The sections of this paper include: literature review that explores the general view of big data and big data analytics. The paper will also discuss the two leading big data analytical tools; Apache Scope and Apache Hadoop.
APACHE HADOOP VERSUS APACHE SPARK6 2Big Data The availability and exponential growth of massive amount of information with different variety is referred to as Big Data(Hoskins, 2014). Big Data is a term that is popularly used in the current automated world and is perceived to be as important to the society and business as the internet. It is extensively proved and believed that more data result to more precise assessments, which in turns lead to more timely, legitimate and confident decision making(Bettencourt, 2014). Better decisions and judgement result in reduced risk, higher operational efficiencies, and reductions of cost. Researchers of Big Data picture big data as follows: Volume-wise:this is a significant factor that has led to the emergence of big data. volume is increasing to different factors. Governments and companies have been documenting transactional data for years. Social media is consistently sending automation, machine-to-machine data, unstructured data, sensors data, among others(Saeed, 2018). Previously, storage of data was still a problem, however, the emergence of affordable and advanced storage devices has assisted in addressing the issue of data storage(Bughin, 2016). Nevertheless, volume still causes other problems such as identifying the significance within huge data volumes and gathering important information by analysing the data. Velocity-wise:the rate at which data volume is increasing is becoming critical and it is challenging to address the issue with efficiency and in time. The need to manage large pieces of information in real time is brought by the rise of RFID (radio-frequency identification) tags, robotics, sensors and automation, internet streaming, among other technology facilities (Catlett & Ghani, 2015). As such, increase in data velocity is among the biggest challenge being experienced by every big company today.
APACHE HADOOP VERSUS APACHE SPARK7 Variety-wise: although the increase of huge data volume is a huge challenge, data variety is a bigger problem. Information is increasing in different varieties including, different formats, unstructured, various file systems, images, financial data, scientific data, structured, relational and non-relational, videos, multimedia, aviation data, etc(Dhar, 2014). The issue is finding ways to correlate the various types of data in time to obtain value from this data. currently, many companies are trying hard to acquire better solutions to the issue. Variability-wise: the inconsistent trend of the flow of big data is a big challenge. Social media reaction to events across the globe drives large volumes of information which requires timely assessments before trend changes(Diesner, 2015). Events across the world have an impact on financial markets, and this operating cost increase more while handling unstructured data. Complexity-wise: huge volumes of data, inconsistent trends, and increasing variety of data makes big data very challenging. In spite of all the above facts, big data must be sorted out to correlate, connect and develop useful relational linkages and hierarchies in time before the information becomes difficult to control(Dumbill, 2013). This illustrates the complexity involved in today’s big data. in short, any repository of big data with the following features can be referred to as big data. Central planning and management Extensible: primary capabilities can be altered and augmented Manages huge amounts of data(Zeide, 2017) Less costly Offer capabilities for processing data Accessibility: highly available open source or commercial good with excellent usability(Hare, 2014) Distributed repetitive data storage
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK8 Very fast data insertion Hardware sceptic Parallel processing of tasks 3Big Data Analytics Big data analytics is the practice of employing assessment algorithms operating on great supporting channels to reveal potentials hidden in big data, such as unknown patterns or hidden correlations(Tromp, Pechenizkiy & Gaber, 2017). Based on the time required to process big data, big data analytics can be grouped into two different paradigms. Batch processing: here, information is first kept and then assessed. The leading model for batch processing is MapReduce. The basic concept of MapReduce is that information is first split into small portions(Cercone, F'IEEE, 2015). These portions are later processed in a distributed and parallel way to create intermediate outcomes. The end result is acquired by combining all the intermediate outcomes(Al Jabri, Al-Badi & Ali, 2017). The MapReduce organizes computation resources near the location of data, which prevents the occurrence of communication cost of transmitting data. The model is easy to use and is extensively used in web mining, bioinformatics and machine reading. Streaming processing: the first thing is to assume that data value relies on the freshness of data. therefore, the streaming processing model evaluates information in a timely manner to obtain it outcome. In this model, information is acquired in a stream. In its constant acquisition, because the stream carries large volume and is fast, only a small section of the stream is kept in insufficient memory (Batarseh, Yang & Deng, 2017). The few that passes over the stream are used to attain
APACHE HADOOP VERSUS APACHE SPARK9 approximation results. Streaming processing technology and theory have been analysed for years. The streaming processing model is employed for online applications, usually at the millisecond or second level(Bornakke & Due, 2018). 4Big Data Analytic Tools There are many big data tools for data evaluation today. However, only two tools will be discussed in this paper; Apache Spark and Apache Hadoop. 5Apache Hadoop Apache Hadoop is an open-source data framework or platform built in Java, devoted to analyse and store huge amounts of unstructured data(E. Laxmi Lydia & Srinivasa Rao, 2018). Digital mediums are transmitting large amounts of data, as such, new technologies of big data are emerging at a fast rate. Nevertheless, Apache Hadoop was among the first tool to be innovated. It enables several simultaneous tasks to execute from one to many servers without delay(Kim & Lee, 2014). It comprises of a distributed file that permits transmission of files and data between different nodes in split seconds. Besides, Apache Hadoop has the ability to process effectively even in cases of node failure(MATACUTA & POPA, 2018). 5.1Evolvement of Apache Hadoop Scientists Mike Cafarella and Doug Cutting generated the Hadoop 1.0 platform and introduced it in 2006 to promote delivery for Nutch search engine. It was inspired by MapReduce of Google which divides an application into small sections to execute on different nodes(Mach-Król & Modrzejewska, 2017). The Apache Software foundation allowed the public to access the tool in 2012. The name of the tool came from Doug Cutting’s kid yellow soft toy elephant. In the process of its modification, a second improved version
APACHE HADOOP VERSUS APACHE SPARK10 Hadoop 2.30 was launched on 20thFebruary 2014. It consisted of major adjustments in the architecture. 5.2Hadoop Ecosystem/ Architecture Hadoop ecosystem is a framework or platform which assist in addressing the issues of big data. It consists of various components and services such as storing, maintaining, ingesting and analysing(Landset, Khoshgoftaar, Richter & Hasanin, 2015). The structural design can be divided into two parts, that is, Hadoop core and other/ complementary components. 5.3Components of Apache Hadoop The figure below illustrates the key components of Apache Hadoop. Figure1: Key Components of Apache Hadoop (Source: Landset, Khoshgoftaar, Richter & Hasanin, 2015) 5.3.1HDFS (Hadoop distributed file system)
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK11 HDFS stores information in small memory block and transmits them through a bunch (Naidu, 2018). Every data is duplicated several times to make sure data is available. HDFS is the most essential element of Hadoop Ecosystem. It is the basic Hadoop storage system. It is developed in Java and offers fault tolerant, cost efficient, scalable and dependable data storage for big data(Mavridis & Karatza, 2017). HDFS executes on product hardware and comprises of default configuration for multiple installations. Many instances configuration is required for large sets of data. Hadoop uses shell-like commands to communicate directly with HDFS. Components of HDFS There are two main Hadoop HDFS components; DataNode abd NameNode. NameNode: it is also referred to as Master node. It does not store dataset or actual data. instead, it stores Metadata, that is, their location, which Datanode the information is kept, number of blocks, on which Rack, among other details. It comprises of directories and files. The work of HDFS NameNode is to manage namespace of the filesystem, control user’s access to documents and run execution of file system such closing, naming, opening directories and files(Aung & Thein, 2014). DataNode: it is also referred to as Slave. Its work is to store actual data in HDFS. Datanode carries out the read and write functions as requested by the users. DataNode replica blocj comprises of two documents on the file system. The first document is for data and the second document is for documenting sections of metadata. Data checksums are included in the HDFS Metadata. In the initial set-up, each Datanode joins with its matching Namenode and begins the interaction. Validation of Datanode software version and namespace ID happens during the first interaction. If a mismatch is discovered, Datanode will automatically go down. HDFS
APACHE HADOOP VERSUS APACHE SPARK12 DataNode is responsible for performing operations such as block replica deletion, creation and duplication according to the NameNode instruction(Hussain & T, 2016). Another task carried out by DataNode is to monitor data storage of the system. 5.3.2Hadoop MapReduce It runs tasks in an equivalent manner by transmitting it as small blocks. Hadoop MapReduce offers data processing. It is a software structure for writing programs easily that process the huge amount of unstructured and structured data kept in the HDFS(Singh & Reddy, 2014). MapReduce applications are similar in nature, therefore are very helpful for carrying out analysis of huge amount of data using several machines in the set. As such, it enhances the reliability and speed of cluster. Figure2: Hadoop MapReduce (Source: Greeshma & Pradeepini, 2016) Hadoop MapReduce works by dividing the processing into two stages: Reduce phase and map phase(Greeshma &Pradeepini, 2016). Each phase incorporates the concept of the key-value pair as output and input. Besides, programmer particularize the two functions; reduce function and map function.
APACHE HADOOP VERSUS APACHE SPARK13 The map function takes data set and transforms it into another data set, where individual components are divided into key/value pairs (tuples). The reduce function, on the other hand, translates the Map output as input and joins those data tuples using the key and appropriately adjusts the key value(Lathar & Srinivasa, 2019). The following are some of the features of MapReduce. Simplicity: MapReduce workloads are easy to execute. Programs can be scripted using any language such C++, Java and python(Glybovets & Dmytruk, 2016). Speed: MapReduce through parallel processing, address problems that take more than one day to solve in minutes and hours. Scalability: MapReduce can manage a large amount of data. Fault tolerance: MapReduce handles failures. If a single copy of data is not available, another device possesses the same copy of key par that can be utilized to address the same work. 5.3.3Hadoop Common It is a group of common libraries and utilities which manage other modules of Hadoop. It ensures that the failures in hardware are automatically managed by Hadoop bunch. The Hadoop Common component is perceived as the core/ base of the structure as it offers important services and underlying processes such as extraction of the basic operating system and its file systems. In addition, Hadoop Common has the required JAR (Java Archive) scripts and files needed to start Hadoop. Besides, the component offers documentation and source code and a contribution part that incorporates various tasks from the Hadoop community. 5.3.4Hadoop Yarn
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK14 It assigns resources which permit various users to run applications without being bothered about the increased amount of work. Yarn is also referred to as the Hadoop operating system as it is accountable for monitoring and managing workloads(Huang, Meng, Zhang & Zhang, 2017). It enables several engines of data processing such as batch processing and real-time processing to manage data kept on one platform. Figure3: Hadoop Yarn Diagram (Source: Huang, Meng, Zhang & Zhang, 2017) Yarn has been identified as a Hadoop 2 data processing system. The major features of Yarn include: Flexibility: enables other models of data processing such as streaming and interactive. As such, other programs can also be executed together with programs of MapReduce in Hadoop 2. Shared: offers a dependable, shared, stable and secure functional services across several tasks. Other programming paradigms such as iterative modelling and graph processing have been made possible for data processing.
APACHE HADOOP VERSUS APACHE SPARK15 Efficiency: since multiple programs execute on the same cluster, Hadoop efficiency increases without affecting the service quality. 5.3.5Other Hadoop Components Ambari This is a web-based interface for configuring, managing and examining big sets of data to console its elements such as MapReduce, HCatalog, ZooKeeper, Pig, HDFS, Hive, HBase, Oozie and Sqoop. It offers a support for managing the cluster’s health and permits evaluating the performance of specific elements such as Pig, MapReduce, Hive, among others in a user-friendly manner. Management of Hadoop becomes easier since Ambari offer secure and consistent platform for operational control Figure4: Ambari (Source: Landset, Khoshgoftaar, Richter & Hasanin, 2015) The following are some of the features of Ambari: Centralized security system: Ambari minimizes the complexity to configure and administer the security of a cluster across the whole platform.
APACHE HADOOP VERSUS APACHE SPARK16 Complete visibility into the health of the cluster: Ambari makes sure that the cluster is available and healthy with a complete approach to supervision. Simplified configuration, installation and management: Ambari efficiently and easily generate and monitor clusters at scale. Highly customizable and extensible: Ambari is flexible for incorporating custom services in the management. Cassandra It is an open source distributed database system that is highly scalable built on NoSQL committed to manage huge amount of data across several product servers, with an aim of supporting high availability without failure(Fan, Ramaraju, McKenzie, Golab & Wong, 2015). Flume It is a reliable and distributed component for efficiently aggregating, collecting and transmitting streaming large data into HDFS. Flume is a reliable and fault tolerant tool. The component of Hadoop Ecosystem permits the flow of data from its origin into Hadoop ecosystem. It utilizes a simple expandable data model that accommodate online analytical program(Ranawade, Navale, Dhamal, Deshpande & Ghuge, 2017). Flume assist in instant acquisition of data from several servers into Hadoop.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK17 Figure5: Flume (Source: Ranawade, Navale, Dhamal, Deshpande & Ghuge, 2017) HBase It is a non-interactive distributed database executing on the Hadoop big data set that stores huge amount of structured data in tables that could consist of millions of columns and billions of rows. HBase serve as an input for workloads of MapReduce. It is distributed, scalable and NoSQL database developed on HDFS top. HBase offer real-time right to use read or write information in HDFS(Mavani, 2013).
APACHE HADOOP VERSUS APACHE SPARK18 Figure6: HBase (Source: Mavani, 2013) There are two components of HBase; RegionServer and HBase Master. HBase Master: it is not included in the actual data storage. However, it collaborates in load balancing within RegionServer. It monitors and manages the Hadoop cluster. Besides, it carries out administration (it has an interface for updating, generating and deleting tables). HBase master also manages the failover and controls DDL operation(Xu & Liang, 2013).
APACHE HADOOP VERSUS APACHE SPARK19 RegionServer: it is the employee node which manages read, updates, write and delete request from customers. RegionServer process executions on each Hadoop cluster node. It executes on DateNode of HDFS. Solr It is a search component that is highly scalable which enables central configuration, recovery, indexing and failovers. The programs developed using Solr are of high level and offer high performance(Chen, Xu & Zhu, 2014). Solr assist in identifying needed data from huge amount of data. Besides, it can be applied for storage reason. The following are some of the features of Solr. Restful APIs: to interact with Solr, it is not a must for one to have Java programming skills. Rather, one can utilize restful services to interact with it(Vis, 2013). Documents in Solr are entered in file formats such as JSON, XML and .CSV and results are acquired in the same file formats. Enterprise ready: depending on the requirements of the company, Solr can be installed in multiple type of systems, small or big, such as distributed, standalone, cloud, among others. NoSQL database: Solr can allow distribution of search tasks across a cluster(Yan, Liu & Lao, 2014). Highly scalable: the capacity of Solr can be scaled by adding replicas. Full tect search: Solr offers all skills required for a full text search such as phrases, wildcard, tokens, spell check and auto-complete. Extensible and Flexible: by expanding the Java classes and setting it up appropriately, the Solr components can be customized easily(Cassales, Charão, Pinheiro, Souveyet & Steffenel, 2015).
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK20 Admin interface: Solr offers a user friendly, easy-to-use, and feature powered user interface that assist in carrying out all possible workloads such as add, update, manage logs, delete and search files. Text centric and organized by importance: Solr is regularly utilized to search text files and the outcome is provided according to the importance with the query of the user in order. Hadoop Sqoop Sqoop is a tool that transmit large amount of data between structured databases and Hadoop. It brings data in from external sources into connected components of Hadoop Ecosystem such as HDFS, Hive, or HBase. Besides, it transfers information from Hadoop to other external sources(Chen & Jiang, 2015). Sqoop is used with other relational databases like Netezza, MySQL, Teradata and oracle.
APACHE HADOOP VERSUS APACHE SPARK21 Figure7: Apache Sqoop (Source: Chen & Jiang, 2015) Some of the features of Sqoop include: Brings in consecutive sets of data from mainframe: Sqoop meets the increasing demand to transfer data from the mainstream to HDFS(Chen, Ko & Yeo, 2015). Transmit direct to ORC documents: enhances light weight indexing, compression and query performance.
APACHE HADOOP VERSUS APACHE SPARK22 Parallel transfer of data: this is important for optimal system usage and faster performance(Lehrack, Duckeck & Ebke, 2014) Effective data analysis: enhances effectiveness of data evaluation by merging unstructured data and structured data on a representation on reading data lake. Fast copies of data: fast data copies are made from an external source into Hadoop. Hadoop Zookeeper An open source system that organizes and harmonizes distributed systems. Zookeeper is used for naming, offering group services, maintaining configuration data and offering distributed management(Okorafor, 2012). It coordinates and manages a huge cluster of devices. Figure8: Hadoop Zookeeper (Source: Okorafor, 2012) Some of the features of zookeeper include: Fast: Zookeeper demonstrates high speed in tasks where data reads are more than writes. The standard ratio of read/ write is 10:1.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK23 Ordered: Zookeeper maintains information of all dealings. Hcatalog Hcatalog is a layer of storage management which enables developers to share and access data. It is a storage and table management tier of Hadoop. Hcatalog supports various elements of Hadoop ecosystem such as Hive, MapReduce and Pig to easily write and read data from the cluster. It is core element of Hive that allows the user to keep their information in any structure and format. Hcatalog supports CSV, Sequence file, RC file, JSON and ORC file formats by default. Hcatalog is associated with several benefits including: Provide notifications for data availability Hcatalog exempts user from data storage cost Offer visibility for archiving tools and data cleaning Hadoop Hive Hive is a warehouse infrastructure of data that performs three major functions: data query, summarization and analysis(Agarwal, 2018). Hive utilizes language referred to as HQL (HiveQL), a query language that is similar to SQL. HiveQL interprets queries similar to those of SQL into MapReduce tasks which will run on Hadoop.
APACHE HADOOP VERSUS APACHE SPARK24 Figure9: Apache Hadoop Hive (Source: Agarwal, 2018) The major parts of hive are: Metastore: this is where the metadata is stored Query compiler: run HiveQL into DAG (Directed Acyclic Graph) Driver: control the HiveQl statement lifecycle Hive server: offer a thrift server. Hadoop Oozie Hadoop Oozie is a server-based application that manages and schedules the Hadoop workloads. It merges several tasks consecutively into a single logical work unit. The framework of the component is completely connected with YARN and support Hadoop tasks for Pig, Sqoop, MapReduce and Hive. Figure10: Hadoop Oozie (Source: Agarwal, 2018) Users can develop DAG workflow, which can execute consecutively and in parallel in Hadoop. Oozie is expandable and handle timely operations of multiple workflow in a Hadoop cluster. Besides, Oozie is very flexible. Users can easily stop, rerun, start and suspend tasks.
APACHE HADOOP VERSUS APACHE SPARK25 Users can also omit a particular node or re-execute it in Oozie. There are two types of Oozie tasks: Oozie workflow: it runs and stores workflows of Hadoop tasks such as Pig, MapReduce and Hive Oozie coordinator: it executes workflow tasks based on data availability and predefined schedules. Hadoop Pig Hadoop Pig is a committed high-level tool which is accountable for controlling the information kept in HDFS(Barskar & Phulre, 2017). It is supported by a MapReduce compile and a language referred to as Pig Latin. It permits specialists to ETL (extract, transform and load) the information without writing MapReduce codes. Hadoop Pig loads the information, uses the needed filters and leaves the information in needed format. Pig needs Java runtime environment to run programs(rna C & Ansari, 2017). Figure11: Hadoop Pig (Source: Barskar & Phulre, 2017) The features of Pig include:
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK26 Expandability: users can develop their function for performing specific-purpose processing Manages all types of data: Pig evaluates both unstructured and structured data Optimization opportunities: the system can optimize automatic operations. This help the user to concentrate om semantics rather than efficiency. Avro Avro is component of Hadoop ecosystem and well known for serialization of data. it is an open source system that offers data exchange and data serialization services for Hadoop (Plase, Niedrite & Taranovs, 2017). These facilities can be utilized independently or together. With the help of Avro, Big data can swap applications written in various languages. Data can be organized into messages or files using serialization service programs. It stores information in a single file or message allowing programs to easily understand data kept in Avro message or file. Avro schema: it depends on schemas foe deserialization/ serialization. Schema is required by Avro for data read or writes. Avro information is stored together with its schema in a document. As such, documents can be processed any time by any program. Dynamic typing: it means deserialization and serialization without generating codes. It supplements the creation of code available in Avro for fixed typed language as an alternative for optimization. The following are some of the features provided by Avro: It has a remote subroutine (procedure) call It has a container file to keep continuous data It has a rich data framework It has a fast, compact and binary data format
APACHE HADOOP VERSUS APACHE SPARK27 Thrift Thrift is a software system for extensible development of cross-language services. It is an interface definition framework for remote procedure call (RPC) communication. To use Apache Thrift for execution or other operations, Hadoop has to conducts many calls of RPC. Figure12: Thrift (Source: Barskar & Phulre, 2017) Apache Drill The main goal of Apache Drill is to process large amount of data including semi- structured and structured data. It is a distributed query machine with low latency designed to
APACHE HADOOP VERSUS APACHE SPARK28 query huge amounts of data and scale multiple nodes(Hausenblas & Nadeau, 2013). The drill is the first query machine whose model is schema-free. The drill is a helpful component at cardlytics, an organization that offers end-user purchase information for internet and mobile banking. Drill is being used at Cardlytics to quickly execute queries and process huge amounts of records. The drill has a unique memory management system to enhance memory usage and allocation and to eradicate garbage collection. Drill works with Hive by enabling developers to re-utilize their current Hive implementation. Some of the features of Apache Drill include: Expandability: Drill offers a scalable design at all levels including query optimization, query layer, and customer API. Any layer can be expended for the unique need of a company. Discovery of dynamic schema: It is not mandatory for Apache drill to have type specification or schema for data in order to conduct the process of query execution. Rather, drill carries out its data processing in units referred to as record batches and identifies schema during processing. Flexibility: drill offers a tiered columnar model for data that can represent highly flexible and complex data and permit effective processing. Drill decentralized metadata: drill is different from other SQL technologies of Hadoop. It lacks the centralized metadata requirement. Users of drill do not need to manage and generate tables in metadata to carry out data query. Apache Mahout Mahout is an open source engine for developing data mining library and expandable algorithm for machine learning. After data has been stored in Hadoop HDFS, Mahout offers the data science devices to identify important patterns in those big sets of data. algorithms of Mahout include:
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK29 Clustering: the item is taken in specific class and organized into groups that appear to be natural, so that item fitted in the same group are similar to one another. Collaborative filtering: it discovers the behaviour of the user and makes recommendation on a product, for instance, Amazon recommendations. Classifications: it learns from current grouping and then allocates uncategorized items to the best group. Regular pattern mining: it evaluates group items, such as terms in query session or items in a shopping cart, and pinpoint which elements naturally appear together. All the above components of Hadoop ecosystem strengthen the functionality of Hadoop. 5.4Hadoop Download To operate in the environment of Hadoop, the first step is to download Hadoop. This step can be carried out using any device at no cost since the framework is accessible as an open source tool. However, there are specific system requirements that need to be met to ensure the download process is successful. They include: Hardware requirements: Hadoop can operate on any common hardware cluster. All that is required is some product hardware. Operating system (OS) requirement: Hadoop can operate on the Windows and Unix platforms. The only platform utilized for product requirements is Linux. Browser requirement: Hadoop supports most of the famous browsers such as the Explorer, Google Chrome, Microsoft Internet, Mozilla Firefox, Safari for Macintosh, Windows and Linux systems based on the need.
APACHE HADOOP VERSUS APACHE SPARK30 Software requirement: because the framework of Hadoop is created using Java language, the Hadoop software requirement is Java. Java 1.6 is the minimum Java version(CAO, WU, LIU & ZHANG, 2010). Database requirement: components of Hadoop ecosystem such as Hcatalog and Hive uses MySQL database to efficiently run the Hadoop framework. The new version can be used or allow Apache Ambari to choose on the wizard what is needed for the same. 5.5Types of Hadoop installation There are different ways of running Hadoop. Below are some the different scenarios foe downloading, installing and running Hadoop. Standalone mode: although Hadoop is a distributed framework for managing big data, it can be installed on one node in one standalone mode(Ibrahim & Bajwa, 2018). As such the whole Hadoop framework operates like a system which is executing in Java. This is regularly applied for debugging purpose. It is helpful particularly when examining MapReduce programs on one node before executing on a large Hadoop cluster. Completely distributed mode: this is a distributed mode that contains many commodity hardware nodes joined together to create the Hadoop cluster. In such an arrangement, the JobTracker, NameNode, and SecondaryNameNode operate on the master node while the secondarydatanode, TaskTracker and datanode operate on the slave node. Pseudodistributed mode: this is a one Java system that executes the whole Hadoop cluster. As such, the different daemons such as the DataNode, JobTracker, NameNode and TaskTracker execute on one command of the Java system to create the distributed Hadoop cluster.
APACHE HADOOP VERSUS APACHE SPARK31 5.6Major Commands of Hadoop Hadoop comprises of different file system commands that communicate with each other to obtain the needed results. They include: Checksum Movefromlocal Hadoop infrastructure. Appendtofile Copytolocal Chgrep These are the most popular commands applied in Hadoop to carry out different workloads across the Hadoop platform. 5.7Hadoop Streaming Hadoop streaming is the basic API utilized when handling streaming data. Both the Reducer and Mapper acquire their inputs in typical format. The output is transmitted to Stdout and the input is acquired from the Stdin. This is the technique to use within Hadoop to handle consistent data stream and process it in a consistent manner(Gupta, Kumar & Gopal, 2015). Hadoop is the system used for storing and processing big data. Hadoop development involves the computation of big data using different programming languages such as Scala, Java, among others. Hadoop supports many types of data such as Char, Decimal, Float, Boolean, Array, String, Double, among others. Hadoop delivers an essential system called Hadoop data analytics. Some of the appealing details behind the development of Big Data Hadoop include: The HDFS evolved from the Google File system The MapReduce application was developed to analyse web pages The HBase evolved from the Google Big Table.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK32 5.8Reasons to Choose Apache Hadoop With the growing Big data across the globe, the need for developers of Hadoop is increasing at a fast speed. The experienced and skilled Hadoop developers with the expertise of practical execution are highly needed to add value into the current processes. Besides, there are other main reasons for choosing this technology. They include: Extensive utilization of Big data: many organizations are recognizing that in order to manage the explosion of data, they need to employ a technology that will incorporate such data and bring something valuable and meaningful from it. Undoubtedly, Hadoop has dealt with this issue and organizations are beginning to adopt this technology. Besides, a research carried out by Tableau states that among 2200 of clients, around 76 percent participants who have used Hadoop are hoping to use it in different ways. Security: recently, security is a major issue of IT infrastructure. As such, organizations are enthusiastically devoted in the elements of security more than anything(Sirisha & V.D. Kiran, 2018). Apache sentry, a component of Hadoop ecosystem, provides authorization element to the information kept in the cluster of big data(S. & M., 2019). New technologies are dominating: the big data trend is growing since the users are requesting for higher speed and therefore are declining the traditional warehouses of data. by recognizing the customer’s concern, Hadoop is aggressively incorporating new technologies such as AtScake, Jethro, Cloudera Hadoop Impala, Actian Vector, among others, in its primary infrastructure. The eruption of big data has compelled organizations to employ technologies that can assist them in managing unstructured and complex data in a manner that maximum data could be analysed and extracted without any delay or loss. This need led to the development of big
APACHE HADOOP VERSUS APACHE SPARK33 data tools that are capable of processing several tasks successfully at once. The following are some of the features of Hadoop: Capable of processing and storing sets of data: with the increasing amounts of data, data failure and loss is likely to increase(Manoj Kumar Danthala & Dr. Siddhartha Ghosh, 2015). Nevertheless, Hadoop eases the situation since it is capable of processing and storing complex and huge unstructured sets of data. Excellent computational capabilities: its distributed computational version ensures that big data is processed fast with several nodes executing in parallel. Fewer faults: employing it results to fewer faults. In case of failure in one node, the workloads are automatically sent to other nodes. No pre-processing needed: large amount of data can be stored and reclaimed at the same time, including both unstructured and structured data without necessarily pre-processing before keeping into the database. Highly scalable: the cluster size can be increased from one machine to multiple servers without administering extensively(Suguna, 2016). Cost-efficient: it is free of any cost and thus require very little money to implement it. 5.9Practical Applications of Apache Hadoop Some of the organizations which have employed Hadoop include: Twitter Yahoo Ebay Cloudspace Facebook LinkedIn
APACHE HADOOP VERSUS APACHE SPARK34 AOL Alibaba Both the world and the professionals are currently revolving around data analytics. Thus, Hadoop will certainly serve as a support for the candidates willing to build a career in big data analytics. Furthermore, it is suitable for ETL developers, software specialists, analytics professionals, among others. Nevertheless, a comprehensive knowledge of DBMS, Java and Linux is an added advantage in the analytics domain. Large demand for competent specialists: based on an article reported by Forbes in 2015, about 90 percent of companies across the globe are devoting their resources in big data analytics and approximately one third of companies refer to it as very important. As such, it can be suggested that Big Data Hadoop is not just a technology but a powerful tool to organizations making their appearance in the market. Thus, acquiring knowledge of Hadoop is very crucial for the beginners hoping to be analysts in ten years to come. More market opportunities: trends in the market provide an upward curve for big data analytics. They demonstrate that the need for data analysts and scientists will constantly increase. This evidently points out that acquiring more knowledge of this technology will guarantee a successful career in any business. Big bucks: statistics shows that the salary of a Hadoop developer in the united states is $102,000. This evidently shows that learning Hadoop will provide an opportunity of grabbing the best paying jobs in the world of data analytics. Hadoop has captured the market of Big Data by surprise since organizations are consistently benefiting from its reliability and scalability. Although there are many players in the market, Apache Hadoop has demonstrated constant advancements thus making it a better choice for organizations. With the growing number of firms moving towards big data
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK35 analytics, learning Hadoop and being versant with its functionality will, without doubt, guide a candidate to new career heights. 5.10Apache Hadoop Scope With analytical technologies flooding the current market, Hadoop has become famous and is, without doubt, going to make more meaningful impact in firms. The following proofs confirms this in a more comprehensible way: i.According to a survey carried out by markets and markets, the reliability and efficiency of Hadoop has developed a buzz between software stars. Based on the report, this technology has grown up to $13.9 billion in 2017, a 54.9 percent greater that its 2012 market size. ii.Apache Hadoop is in its blossoming phase and its growth is going to improve in the short and long-term future for the following reasons: Organizations require a distributed database with the ability to store huge amount of complex and unstructured data as well as analyse and process the information to identify important insights Organizations are willing to devote their resources in this sector. However, it is necessary to invest in a technology that can be upgraded in lower cost and is detailed in various ways. iii.Marketanalysis.com predicted that the market of Hadoop will be powerful in the following areas between 2017 to 2022: It will have a robust impact in EMEA, America and Asia Pacific It will possess its own appliances, hardware, and commercially supported software, integration, consulting and middleware supports. It will be used in huge area spectrum such as ETL/data integration, social media and clickstream analysis, internet of things and mobile devices,
APACHE HADOOP VERSUS APACHE SPARK36 cybersecurity log analysis, predictive/advanced analytics, data mining/visualization, data warehouse offload, active archive, among others. 6Apache Spark Apache spark is a strewn data processing tool and is general-purpose appropriate for use in a variety of situation. Stream processing, graph computation, machine learning, and libraries for SQL are on top of the Spark core data which can be utilized together in a system. Some of the programming languages that apache spark supports include R, Scala, python, and Java(Hosseini & Kiani, 2018). Data scientists and application developers employ the use of apache spark in their application so as to quickly transform, analyse, and query large amount of data. Some of the activities that are mostly related to spark include SQL batch and ETL tasks across huge amount of data, internet of things (IoT), machine learning activities, financial systems, and processing data streams from sensors. The earlier version of apache spark was similar to MapReduce which facilitated indexing of the huge amount of web content by Google. It was a robust strewn processing framework. AMPLab developed Apache Spark in 2009 which started off as an incubated project of Apache Software Foundation(Ko & Won, 2016). Some of the advantages of spark include faster execution through data caching in memory across several operation running parallel. Secondly, apache spark has the ability to execute multi-threaded activities inside java virtual machine (JVM) tasks. Thirdly, it offers a broader functional programming model, and is useful, particularly for parallel processing of data that is distributed and has iterative algorithms. 6.1Apache Spark Ecosystem Apache spark ecosystem is made up of several components. This section will describe the various components of Apache spark ecosystem. Some of the languages supported by
APACHE HADOOP VERSUS APACHE SPARK37 apache spark include Spark R, Spark SQL, Spark GraphX, Spark Machine Learning, Spark Streaming, and Spark SQL. The ecosystem also contains extensible APIs written in various languages such as R, Java, Python, and Scala which are built on top of the central spark execution engine(Sherif & Ravindra, 2018). Currently, Apache Spark is one of the most popular tools for big data analysis and has been described by some experts as one of the next generation tools that is being utilized by many institutions and organizations with a big number of contributors and is quickly gaining popularity as an accepted execution engine for big data(Singh, Anand & B., 2017). The figure below shows spark ecosystem. Figure13: Apache Spark Ecosystem(Source: Singh, Anand & B., 2017) Apache Spark is an open-source and most powerful processing engine that is used as and substitute for Apache Hadoop. It has increased the productivity of the developers, easy to use, and is based in high speed processing(Sherif & Ravindra, 2018). Additionally, it supports graph computation, real-time processing, and machine learning. Additionally, it offers computing abilities for in-memory and a variety of applications. As mentioned earlier, it also supports APIs for the different languages. The next section will discuss the various programming languages supported by Apache Spark in more details.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK38 Scala: apache spark is developed in Scala programming. It supports a number of amazing features provided by spark. Most of these features are not supported by the other programming languages. Python: python has been employed in apache spark to offer great libraries to be used to analysis of data. When compared with Scala, Python is much slower. R Language: this language has been integrated in spark to support statistical analysis and machine learning. In addition, it improves the productivity of the developer. R language can be used to process data in one machine. Java: this is a great language to use on spark especially for the developers who have been using Java+ Hadoop. 6.2Components of Apache Spark Apache spark ecosystem have several components of which a number of them is still under development. A few enhancements are being done regularly to enhance the capabilities of the application. The following are some of the components empowered by spark ecosystem(Borisenko, Turdakov & Kuznetsov, 2014). 6.2.1Apache Spark core Apache spark core makes up apache spark kernel and is the basis of distributed and parallel processing(CHEN & SUI, 2018). Spark core is responsible for all the important input/output operations. Moreover, it entails job scheduling and monitoring over a cluster, managing the memory efficiently, networking, and fault recovery(Aleksiyants, Borisenko, Turdakov, Sher & Kuznetsov, 2015). Apache spark core provides high speed processing because of the in-memory computation capacity. Resilient Distributed Dataset (RDD) is a special data structure that is used in Spark core. There is need to store data in transitional stores because of the data reusability and sharing of data. RDD has been employed in spark
APACHE HADOOP VERSUS APACHE SPARK39 core to address the limitation resulting from system slow down caused by data sharing and reusability by integrating in-memory computation that is fault tolerant. The RDDs are irreversible, therefore, no changes can be made on them. However, one RDD can be converted into another RDD and this is achieved through Transformation operation. In essence, this means that existing RDDs can be used to produce new RDD. Some for of the primary qualities of spark core include: it supports fault recovery, monitors cluster role, responsible for all primary input/output operations, improves productivity, and is important in spark programming. 6.2.2Apache Spark SQL This is one of the key components of spark ecosystem. Spark SQL is used in situations where data is in large volumes to carry out analysis of data that is structured(Kim & Incheol, 2017). This component can be utilized to produce more information about computations and data structures. Such information is important in conducting extra system optimization and calculate the engine output. Spark SQL does not require any special language to specify the computations. In addition, it facilitates execution of Hive statements on present Hadoop deployment. One major advantage of Spark SQL is that it simplifies the process of merging and extracting different datasets. Semi-structured and structured data can be accessed from spark SQL and is used as a distributed SQL query engine. Some of the primary features of Spark SQL include: it is fully compatible with HIVE data, offer a standard way to access numerous sources of data, and support analysis of both semi- structured and structured data(Kim & Incheol, 2017). 6.2.3Apache Spark Streaming
APACHE HADOOP VERSUS APACHE SPARK40 Spark streaming is one of the frivolous spark ecosystem components. It enables developers to carry out data streaming and batch processing easily. Spark streaming employs the use of continuous stream of input data for real-time processing of data(Boachie & Li, 2019). One of the major benefits of spark streaming is that it supports fast scheduling capacity. Also, it can be used to conduct streaming analytics by consuming data in little- batches so as to support transformation process. Some of the key features of spark streaming include: possibility of integration of historical data and streaming data, enables exactly-once message assurance, fast, reliable, and easy processing of live streams of data, and allows inclusion of Spark MLlib for machine learning. Spark streaming is applied in situations that rapid response and real-time analytics is required for instance cyber security, diagnostics, IoT sensors, alarms among others. It is also essential in online marketing, supply chain management, finance, and campaigns. Spark streaming operates in 3 states as illustrated in the figure below: Figure14: How apache streaming operates (Source:Boachie & Li, 2019) The gathering phase involves identifying in-built stream sources which are categorized into two: advanced sources and basic sources. Processing phase involves
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK41 employing the use of advanced algorithms with enhanced function where the data gathered is processed. Data storage involves moving the processed data to live dashboards, databases, and file systems. Spark streaming supports advance level abstraction. 6.2.4Apache Spark MLlib Spark MLlib is among the most essential spark ecosystem components. It provides high-speed and high-quality algorithms and is a scalable machine learning library. Spark MLlib supports all types of APIs including python, Scala, and Java(Borisenko, Pastukhov & Kuznetsov, 2016). It has emerged that is an essential part in big data mining system. It is compatible with different programming languages, scalable, easy to use, and integrates with other tools easily. MLlib has enhance the development and deployment of scalable pipelines. Spark MLlib supports implementation of various algorithms of machine learning including classification, collaborative filtering, clustering, regression, and decomposition. 6.2.5Apache Spark GraphX This component is built on top of the graph computation to allow users to have a wider reasoning, transform, and build data at a scale. The component already has a library of the popular algorithms. GraphX is an API used for cross-world computations. It is also possible to perform classification, clustering through spark GraphX. This component also enables parallel execution of GraphX and Graph. Apache Spark has in-built graph computation engine for graphical and graph manipulations. 6.2.6Apache Spark R R language has been integrated in apache spark to improve the productivity of the developer for statistical analysis. R language is used together Apache Spark via SparkR to
APACHE HADOOP VERSUS APACHE SPARK42 handle processing in one machine(Vychuzhanin, 2018). Spark R is beneficial to the users as it allows them to enable R with the spark power. Some of the benefits of Spark R include it has facilitates reading data from different sources like JSON files, Hive tables, among others. Additionally, it receives all the optimizations done on the engine as memory management, code generation among others. 6.2.7Scalability Function Apache spark supports scalability to multiple machines and cores and it has the ability to run on huge amount of data mostly in terabytes and clusters with many machines simultaneously(Funika & Koperek, 2016). Moreover, tasks that are processed on data frames are disperse across the cluster. 6.3Running Spark Applications on a Cluster The diagram below illustrates how a spark application runs on a cluster.
APACHE HADOOP VERSUS APACHE SPARK43 Figure15: Running park Applications on a Cluster As shown in the figure above, spark applications are executed as independent processes which are synchronized by Spark session object available in the driver program. The cluster manager or resource allocates processes to workers, one process for a single partition. The processes employ its work unit to the dataset available in its partition and displays a new dataset for a partition. This is due to the fact that iterative algorithms processes data repeatedly, thus, benefiting from storing datasets across the repetitions. Finally, the output will be delivered back to the application driver or store in the storage medium. Apache Spark supports various cluster or resource managers including: Kubernetes, Apache Hadoop YARN, Apache Mesos, and Apache stand alone. Additionally, Spark has a local approach where the executors and the driver execute as process threads on the user
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK44 computer rather than a cluster. This is important especially when one wants to develop an application from a personal workstation. 6.4Applications of Apache Spark Apache spark has the ability to process multiple petabytes of data simultaneously which are distributed across several virtual and physical servers that are integrated. Additionally, it has a large array of support languages (Scala, R, Python, and Java), APIs, and developer libraries. The idea that spark is a flexible tool makes it a good option for a variety of applications or use cases. Apaches spark is mostly applied with common NoSQL databases like MongoDB, Cassandra, Apache, Apache HBase, MapR-DB, with distributed data stores like Amazon S3, Hadoop HDFS, and MapR XD, and with distributed messaging stores like Apache Kfka and MapR-ES. The next section will discuss typical use cases for Apache Spark. Stream Processing: application developers are constantly trying to keep up with streams of data including sensor data, log files among others. Such data comes in streaming simultaneously from different sources. It is important to process and analyse the data as thy come even though it is practical to first save these data as they come assess the retrospectively. For instance, it is important to process data streams from financial activities as they come in order to detect and prevent transactions that show fraudulent activities. Machine learning: the accuracy and feasibility of machine learning approaches continues to increase with increase in volumes of data available. It is now possible to train a tool or a program to detect and make decision when triggered based on well analysed and understood datasets prior to using similar solutions to new data which are unknown. Therefore, Apache spark is more suitable for training machine learning algorithms because it has the ability to store data in memory and execute iterative queries quickly. The time needed
APACHE HADOOP VERSUS APACHE SPARK45 to go through a set of data is reduced significantly because of the ability to execute same queries repeatedly and at a scale so as to determine the most ideal algorithm. Interactive analytics: data scientists and data analysts can use apache spark to analyse their data by running queries and viewing the outcome, and then changing the initial query or altering it to generate more results rather than executing queries that are pre-define to develop dashboards that are static for stock prices, production, or sales line. Spark is best suitable for such situations because of its ability to quickly adapt and respond. Data integration: apache spark is being used in several organization to extract, clan, and standardize data from different sources thus reducing the time and cost required for extract, transform, and load (ETL) processes. 6.5Practical Applications of Apache Spark Several vendors of big data technologies are quickly adopting the apache spark tool because they have recognized the opportunity which it presents to their existing big data products thus adding value to them such as machine learning and interactive querying. Established brands such as Huawei and IBM have invested heavily in the technology. Moreover, several start-up companies are creating business solution that depend either partly or wholly on Apache spark. For instance, Berkeley team in 2013, created Databrick founded on apache spark, that presented a hosted point-to-point data platform that is Spark-driven. The company receive more than $245 million to fund the project between 2013 and 2017 facilitating the employees to continue improving and enhancing the opensource code for Apache Spark project(Wang, Zhao, Liu & Lv, 2018). Furthermore, major vendors of Hadoop like Hortonworks, Cloudera, and MapR have started to support YARN-based Spark together with their current products. Additionally, every vendor is working hard to add value to their clients. Moreover, Huawei, IBM, and other companies have funded apache spark projects to enable them integrate their own products
APACHE HADOOP VERSUS APACHE SPARK46 while supporting extensions and improvement of the Apache project(Kienzler et al., 2018). Some of the companies that run apache spark systems include Tencent (social networking company), Taobao (eCommerce company), and Baidu (Chinese search engine). It is reported that more than 700 terabytes of data were generated every day on Tencent which has more than 800 million active users which is processed on a cluster of computer nodes (over 8000) (Guo, Zhao, Zou, Fang & Peng, 2018). Moreover, Novartis, which is a web-based pharmaceutical company, are running systems that are dependent on apache spark to minimize the time needed to generate data for researchers while making sure that contractual safeguards and ethical principles are sustained. 6.6Reasons to Choose Spark There are several reasons why businesses and organizations should choose spark. These reasons have been summarized into three points; simplicity, speed, and support. Simplicity: the capabilities of apache spark can be accessed through a group of rich APIs which have been specifically designed for easy and quick interaction with large data (Mavridis & Karatza, 2017). The APIs are well-structured and documented in a manner that is simple for application developers and data scientists to put spark to work swiftly. Speed: apache spark operates both on disk and in memory and has been designed for speed. Spark has better performance particularly when handling interactive queries of data that is tore in the memory. As mentioned earlier, data scientists have claimed that spark is 100 time faster that Hadoop MapReduce. Support: Apache spark has the ability to support a wide variety of programming languages such as Scala, R, Python, and Java. It also supports close integration with different common storage solutions such as Apache Cassandra, Apache HBase, Apache Hadoop HDFS, and MapR (event store, database, and file system)(Mehrotra & Grade, 2019).
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK47 Moreover, the apache spark community is global, active, and large and is well supported by commercial providers such as Huawei, IBM, Databricks among others. To conclude this section, one thing about Apache Sparke is the fact that it amplifies the existing tools rather than creating new solutions. The components of apache spark ecosystem have made it a popular bigdata tool as compare to other frameworks. This is because apache spark has the ability to process various data types for instance, graph processing, structure data processing, real-time analytics among other(Poojitha & Sowmyarani, 2018). As a result, apache spark has continued to get more attention and is expected to provide functionality to process ad-hoc queries. Apache spark also replaces MapReduce by offering iterative processing logic and interactive execution of code using Scala REPL an Python. However, one can still compile the code using R and Java. Additionally, this section has identified that apache spark supports several languages including Spark R, Spark SQL, Spark GraphX, Spark Machine Learning, Spark Streaming, and Spark SQL. The ecosystem also contains extensible APIs written in various languages such as R, Java, Python, and Scala which are built on top of the central spark execution engine. Apache Spark is an open-source and most powerful processing engine that is used as and substitute for Apache Hadoop. It has increased the productivity of the developers, easy to use, and is based in high speed processing. Additionally, it supports graph computation, real-time processing, and machine learning. Additionally, it offers computing abilities for in-memory and a variety of applications. Moreover, it has discussed how apache spark supports scalability to multiple machines and cores and its ability to run on huge amount of data mostly in terabytes and clusters with many machines simultaneously(Karau, Konwinski & Wendell, 2015).
APACHE HADOOP VERSUS APACHE SPARK48 Apache spark has the ability to process multiple petabytes of data simultaneously which are distributed across several virtual and physical servers that are integrated. In the final parts, the paper has described how several vendors of big data technologies are quickly adopting the apache spark tool because they have recognized the opportunity which it presents to their existing big data products thus adding value to them such as machine learning and interactive querying. Established brands such as Huawei and IBM have invested heavily in the technology. This section has also elaborated on the reasons why a business or company should choose Apache Spark. 7Comparison between Apache Hadoop and Apache Spark There are many big data analytical tools in the current market. As such, it is challenging to choose the right one. A standard method of comparing the advantages and disadvantage of each is likely to ineffective since organizations should examine each platform from the angle of their specific needs. These context compares the two leading platforms, Apache Spark and Apache Hadoop. 7.1The Market Situation Both Apache Spark and Apache Hadoop are open source frameworks created by Apache Software Foundation. They are both leading platforms in big data analytics. Hadoop has been dominating the market of big data for over 5 years. Recent market survey shows that Hadoop has over 50,000 clients while Spark has more than 10,000 customers. Nevertheless, Spark gained popularity in 2012 and overpowered Hadoop in a span of one year. A new growth rate in the number of installations in 2016 and 2017 demonstrates that the trend is still consistent. The performance of Spark and Hadoop is 47 percent and 14 percent respectively.
APACHE HADOOP VERSUS APACHE SPARK49 7.2The main Difference Between Apache Hadoop and Apache Spark The main difference between the two platforms is based on the method of processing. Hadoop conducts its processing by reading from and writing to a disk while Spark do it in- memory(LI & YANG, 2016). Therefore, the processing speed varies considerably. Hadoop can be 100 times slower than Spark. Besides, the amount of data processed varies. Hadoop support larger sets of data than Spark. The following are some of the tasks that Hadoop is good for: Linear processing of large sets of data: Apache Hadoop permits linear processing of large data sets. It divides a huge amount into smaller portions to be handled individually on various data nodes and automatically collects the outcome across several nodes to revert one result. In cases where the resulting set of data is bigger than available RAM, Hadoop may perform better than Spark. Economical solution where no instant outcomes are expected: Hadoop is considered as an excellent solution if the processing speed is not vital. For example, if data processing can be carried out at night, then the most suitable tool would be Apache Hadoop. The following are some of the tasks Apache Spark is good for: Data processing is fast: Sparks carries out its processing in in-memory thus making it faster than Hadoop. It is 10 times faster in data storage and 100 times faster in data processing compared to Hadoop. Iterative processing: if the job is to process information repeatedly, Spark overpowers Hadoop. Spark’s RDDs (Resilient Distributed Datasets) allow several map processes in memory, while Hadoop has to write temporary outcomes to a disk.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK50 Close to real-time processing: if an organization needs instant insights, then the most suitable tool should be Spark and its in-memory processing(Prasad, 2018). Graph processing: Spark is excellent for repetitive computations that are predictable in graph processing. Besides, Apache Spark has a graph computation API called GraphX. Machine learning: Spark has inbuilt machine learning library known as MLlib, while Hadoop has to look for a third-party to offer it. MLlib has algorithms that works automatically after or even without any particular configuration, installation or modification. This algorithm executes in memory. However, if necessary, a Spark expertise can tune and modify them to fit organizational needs. Joining sets of data: because of its speed, Spark can generate all groupings faster. Nevertheless, Hadoop may perform better in cases where large sets of data is involved and requires a lot of sorting and shuffling. 7.3Examples of Practical Applications Different examples of practical applications are analysed to evaluate which framework outshines the other. The following are some of the examples: Client segmentation: evaluating client behaviour and pinpointing customers’ segments that shows similar patterns of behaviour will assist companies to understand client preferences and develop a special client experience(Zheng, 2015). Risk management: predicting various potential scenarios can assist managers to make appropriate decisions by going for non-risky alternatives. Real-time scam discovery: after historical data is used to train the system using machine-learning algorithms, it can utilize these discoveries to predict or analyse inconsistency in real-time that may indicate a potential scam.
APACHE HADOOP VERSUS APACHE SPARK51 Industrial big data analysis: it is about predicting and detecting inconsistency, however in this case, these inconsistencies are connected to machinery failures. A system that is properly configured gathers information from sensors to identify pre-failure conditions. In all of the above examples, Spark is considered to outperform Hadoop because of its real-time and fast processing. 8Conclusion In conclusion, based on the evidence that this paper has identified, Apache Spark is better than Apache Hadoop in several aspect. As a result, this paper proposes that Apache Spark should be used. In terms of performance, apache spark has higher processing speeds and near real time analytics. On the other hand, Apache Hadoop was designed for batch processing, thus, it does not support real time processing and is much slower in terms of processing as compared to Apache spark. Secondly, apache spark is popular for its ease of use because it come with APIs that are user-friendly build for Spark SQL, Python, Java, and Scala. Apache Spark is also built with an interactive mode to allow the users and application developers to have immediate response for actions and queries taken. On the other hand, Apache Hadoop does not have any interactive elements but only supports add-ons like Pig and Hive. Apache Spark is also compatible with Apache Hadoop and share all the sources of data that Hadoop uses. However, because of better performance, Apache is still the preferred option. In terms of data processing, apache spark has better processing speeds and power as compared to apache Hadoop. This is because, apache spark performs simultaneous operations ins memory using a single step. Apache Spark supports shared secret authentication (password authentication) which is easier to manage than Kerberos authentication that is used
APACHE HADOOP VERSUS APACHE SPARK52 by Apache Hadoop. Apache spark is a strewn data processing tool and is general-purpose appropriate for use in a variety of situation. Stream processing, graph computation, machine learning, and libraries for SQL are on top of the Spark core data which can be utilized together in a system. Apache spark ecosystem have several components of which a number of them is still under development. A few enhancements are being done regularly to enhance the capabilities of the application. The following are some of the components empowered by spark ecosystem. Spark streaming is one of the frivolous spark ecosystem components. It enables developers to carry out data streaming and batch processing easily. Spark streaming employs the use of continuous stream of input data for real-time processing of data. One of the major benefits of spark streaming is that it supports fast scheduling capacity.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK53 9References Agarwal, D. (2018). MAPREDUCE: INSIGHT ANALYSIS OF BIG DATA VIA PARALLEL DATA PROCESSING USING JAVA PROGRAMMING, HIVE AND APACHE PIG.International Journal Of Advanced Research In Computer Science,9(1), 536-540. doi: 10.26483/ijarcs.v9i1.5414 Al Jabri, H., Al-Badi, A., & Ali, O. (2017). Exploring the Usage of Big Data Analytical Tools in Telecommunication Industry in Oman.Information Resources Management Journal,30(1), 1-14. doi: 10.4018/irmj.2017010101 Aleksiyants, A., Borisenko, O., Turdakov, D., Sher, A., & Kuznetsov, S. (2015). Implementing Apache Spark jobs execution and Apache Spark cluster creation for Openstack Sahara.Proceedings Of The Institute For System Programming Of RAS,27(5), 35-48. doi: 10.15514/ispras-2015-27(5)-3 Aung, O., & Thein, T. (2014). Enhancing NameNode Fault Tolerance in Hadoop Distributed File System.International Journal Of Computer Applications,87(12), 41-47. doi: 10.5120/15264-4020 Barskar, A., & Phulre, A. (2017). Opinion Mining of Twitter Data using Hadoop and Apache Pig.International Journal Of Computer Applications,158(9), 1-6. doi: 10.5120/ijca2017912854 Batarseh, F., Yang, R., & Deng, L. (2017). A comprehensive model for management and validation of federal big data analytical systems.Big Data Analytics,2(1). doi: 10.1186/s41044-016-0017-x
APACHE HADOOP VERSUS APACHE SPARK54 Bettencourt, L. (2014). The Uses of Big Data in Cities.Big Data,2(1), 12-22. doi: 10.1089/big.2013.0042 Boachie, E., & Li, C. (2019). Big data processing with Apache Spark in university institutions: spark streaming and machine learning algorithm.International Journal Of Continuing Engineering Education And Life-Long Learning,29(1/2), 5. doi: 10.1504/ijceell.2019.099217 Borisenko, O., Pastukhov, R., & Kuznetsov, S. (2016). Deploying Apache Spark virtual clusters in cloud environments using orchestration technologies.Proceedings Of The Institute For System Programming Of The RAS,28(6), 111-120. doi: 10.15514/ispras- 2016-28(6)-8 Borisenko, O., Turdakov, D., & Kuznetsov, S. (2014). Automating cluster creation and management for Apache Spark.Proceedings Of The Institute For System Programming Of RAS,26(4), 33-44. doi: 10.15514/ispras-2014-26(4)-3 Bornakke, T., & Due, B. (2018). Big–Thick Blending: A method for mixing analytical insights from big and thick data sources.Big Data & Society,5(1), 205395171876502. doi: 10.1177/2053951718765026 Bughin, J. (2016). Big data, Big bang?.Journal Of Big Data,3(1). doi: 10.1186/s40537-015- 0014-3 CAO, N., WU, Z., LIU, H., & ZHANG, Q. (2010). Improving downloading performance in hadoop distributed file system.Journal Of Computer Applications,30(8), 2060-2065. doi: 10.3724/sp.j.1087.2010.02060
APACHE HADOOP VERSUS APACHE SPARK55 Cassales, G., Charão, A., Pinheiro, M., Souveyet, C., & Steffenel, L. (2015). Context-aware Scheduling for Apache Hadoop over Pervasive Environments.Procedia Computer Science,52, 202-209. doi: 10.1016/j.procs.2015.05.058 Catlett, C., & Ghani, R. (2015). Big Data for Social Good.Big Data,3(1), 1-2. doi: 10.1089/big.2015.1530 Cercone, F'IEEE, N. (2015). What's the big deal about big data?.Big Data And Information Analytics,1(1), 31-79. doi: 10.3934/bdia.2016.1.31 Chen, C., & Jiang, S. (2015). Research of the Big Data Platform and the Traditional Data Acquisition and Transmission based on Sqoop Technology.The Open Automation And Control Systems Journal,7(1), 1174-1180. doi: 10.2174/1874444301507011174 Chen, F., Xu, C., & Zhu, Q. (2014). A Design of a Sci-Tech Information Retrieval Platform Based on Apache Solr and Web Mining.Applied Mechanics And Materials,530-531, 883-886. doi: 10.4028/www.scientific.net/amm.530-531.883 Chen, L., Ko, J., & Yeo, J. (2015). Analysis of the Influence Factors of Data Loading Performance Using Apache Sqoop.KIPS Transactions On Software And Data Engineering,4(2), 77-82. doi: 10.3745/ktsde.2015.4.2.77 CHEN, M., & SUI, H. (2018). Parallel Entity Resolution with Apache Spark.Destech Transactions On Engineering And Technology Research, (ecame). doi: 10.12783/dtetr/ecame2017/18462 Dhar, V. (2014). Why Big Data = Big Deal.Big Data,2(2), 55-56. doi: 10.1089/big.2014.1522 Diesner, J. (2015). Small decisions with big impact on data analytics.Big Data & Society,2(2), 205395171561718. doi: 10.1177/2053951715617185
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK56 Dumbill, E. (2013). Making Sense of Big Data.Big Data,1(1), 1-2. doi: 10.1089/big.2012.1503 E. Laxmi Lydia, D., & Srinivasa Rao, M. (2018). Applying compression algorithms on hadoop cluster implementing through apache tez and hadoop mapreduce.International Journal Of Engineering & Technology,7(2.26), 80. doi: 10.14419/ijet.v7i2.26.12539 Fan, H., Ramaraju, A., McKenzie, M., Golab, W., & Wong, B. (2015). Understanding the causes of consistency anomalies in Apache Cassandra.Proceedings Of The VLDB Endowment,8(7), 810-813. doi: 10.14778/2752939.2752949 Funika, W., & Koperek, P. (2016). SCALING EVOLUTIONARY PROGRAMMING WITH THE USE OF APACHE SPARK.Computer Science,17(1), 69. doi: 10.7494/csci.2016.17.1.69 Glybovets, А., & Dmytruk, Y. (2016). The Effectiveness of Programming Languages in the Apache Hadoop MapReduce Framework.Upravlâûŝie Sistemy I Mašiny, (5 (265), 84- 92. doi: 10.15407/usim.2016.05.084 Greeshma, L., & Pradeepini, G. (2016). Big Data Analytics with Apache Hadoop MapReduce Framework.Indian Journal Of Science And Technology,9(26). doi: 10.17485/ijst/2016/v9i26/93418 Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark.Gigascience. doi: 10.1093/gigascience/giy098 Gupta, P., Kumar, P., & Gopal, G. (2015). Sentiment Analysis on Hadoop with Hadoop Streaming.International Journal Of Computer Applications,121(11), 4-8. doi: 10.5120/21582-4651
APACHE HADOOP VERSUS APACHE SPARK57 Hare, J. (2014). Bring it on, Big Data: Beyond the Hype.Big Data,2(2), 73-75. doi: 10.1089/big.2014.1520 Hartung, T. (2018). Making Big Sense From Big Data.Frontiers In Big Data,1. doi: 10.3389/fdata.2018.00005 Hausenblas, M., & Nadeau, J. (2013). Apache Drill: Interactive Ad-Hoc Analysis at Scale.Big Data,1(2), 100-104. doi: 10.1089/big.2013.0011 Hoskins, M. (2014). Big Data 2.0: Cataclysm or Catalyst?.Big Data,2(1), 5-6. doi: 10.1089/big.2014.1519 Hoskins, M. (2014). Common Big Data Challenges and How to Overcome Them.Big Data,2(3), 142-143. doi: 10.1089/big.2014.0030 Hosseini, B., & Kiani, K. (2018). A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark.Symmetry,10(8), 342. doi: 10.3390/sym10080342 Huang, W., Meng, L., Zhang, D., & Zhang, W. (2017). In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model.IEEE Journal Of Selected Topics In Applied Earth Observations And Remote Sensing,10(1), 3-19. doi: 10.1109/jstars.2016.2547020 Hussain, A., & Roy, A. (2016). The emerging era of Big Data Analytics.Big Data Analytics,1(1). doi: 10.1186/s41044-016-0004-2 Hussain, G., & T, T. (2016). File Systems and Hadoop Distributed File System in Big Data.IJARCCE,5(12), 36-40. doi: 10.17148/ijarcce.2016.51207
APACHE HADOOP VERSUS APACHE SPARK58 Ibrahim, M., & Bajwa, I. (2018). Design and Application of a Multi-Variant Expert System Using Apache Hadoop Framework.Sustainability,10(11), 4280. doi: 10.3390/su10114280 Karau, H., Konwinski, A., & Wendell, P. (2015).Learning Spark. O'Reilly Media. Kienzler, R., Karim, R., Alla, S., Amirghodsi, S., Rajendran, M., Hall, B., & Mei, S. (2018).Apache Spark 2. Birmingham: Packt Publishing Ltd. Kim, J., & Incheol, K. (2017). SSQUSAR : A Large-Scale Qualitative Spatial Reasoner Using Apache Spark SQL.KIPS Transactions On Software And Data Engineering,6(2), 103-116. doi: 10.3745/ktsde.2017.6.2.103 Kim, S., & Lee, I. (2014). Block Access Token Renewal Scheme Based on Secret Sharing in Apache Hadoop.Entropy,16(8), 4185-4198. doi: 10.3390/e16084185 Kirkpatrick, R. (2013). Big Data for Development.Big Data,1(1), 3-4. doi: 10.1089/big.2012.1502 Ko, S., & Won, J. (2016). Processing large-scale data with Apache Spark.Korean Journal Of Applied Statistics,29(6), 1077-1094. doi: 10.5351/kjas.2016.29.6.1077 Landset, S., Khoshgoftaar, T., Richter, A., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem.Journal Of Big Data,2(1). doi: 10.1186/s40537-015-0032-1 Lathar, P., & Srinivasa, K. (2019). A Study on the Performance and Scalability of Apache Flink Over Hadoop MapReduce.International Journal Of Fog Computing,2(1), 61-73. doi: 10.4018/ijfc.2019010103
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
APACHE HADOOP VERSUS APACHE SPARK59 Lehrack, S., Duckeck, G., & Ebke, J. (2014). Evaluation of Apache Hadoop for parallel data analysis with ROOT.Journal Of Physics: Conference Series,513(3), 032054. doi: 10.1088/1742-6596/513/3/032054 LI, Y., & YANG, S. (2016). Integrating Apache Spark and External Data Sources Using Hadoop Interfaces.Destech Transactions On Engineering And Technology Research, (ssme-ist). doi: 10.12783/dtetr/ssme-ist2016/3990 Mach-Król, M., & Modrzejewska, D. (2017). ANALYTICAL NEEDS OF POLISH COMPANIES VS. BIG DATA.Informatyka Ekonomiczna,2(44), 82-93. doi: 10.15611/ie.2017.2.07 Manoj Kumar Danthala, & Dr. Siddhartha Ghosh. (2015). Bigdata Analysis: Streaming Twitter Data with Apache Hadoop and Visualizing using BigInsights.International Journal Of Engineering Research And,V4(05). doi: 10.17577/ijertv4is050643 MATACUTA, A., & POPA, C. (2018). Big Data Analytics: Analysis of Features and Performance of Big Data Ingestion Tools.Informatica Economica,22(2/2018), 25-34. doi: 10.12948/issn14531305/22.2.2018.03 Mavani, M. (2013). Comparative Analysis of Andrew Files System and Hadoop Distributed File System.Lecture Notes On Software Engineering, 122-125. doi: 10.7763/lnse.2013.v1.27 Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark.Journal Of Systems And Software,125, 133- 151. doi: 10.1016/j.jss.2016.11.037
APACHE HADOOP VERSUS APACHE SPARK60 Mavridis, I., & Karatza, H. (2017). Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark.Journal Of Systems And Software,125, 133- 151. doi: 10.1016/j.jss.2016.11.037 Mehrotra, S., & Grade, A. (2019).Apache Spark Quick Start Guide. Birmingham: Packt Publishing Ltd. Naidu, D. (2018). BIG DATA “WHAT-HOW-WHY” AND ANALYTICAL TOOLS FOR HYDROINFORMATICS.International Journal Of Advanced Multidisciplinary Scientific Research,1(4), 37-47. doi: 10.31426/ijamsr.2018.1.4.214 Okorafor, E. (2012). Availability Of JobTracker Machine In Hadoop/MapReduce Zookeeper Coordinated Clusters.Advanced Computing: An International Journal,3(3), 19-30. doi: 10.5121/acij.2012.3302 Plase, D., Niedrite, L., & Taranovs, R. (2017). A Comparison of HDFS Compact Data Formats: Avro Versus Parquet.Mokslas - Lietuvos Ateitis,9(3), 267-276. doi: 10.3846/mla.2017.1033 Poojitha, G., & Sowmyarani, C. (2018). Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark.International Journal Of Computer Applications,182(24), 8-13. doi: 10.5120/ijca2018917942 Prasad, K. (2018). Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster.JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES,13(5). doi: 10.26782/jmcms.2018.12.00013 Ranawade, S., Navale, S., Dhamal, A., Deshpande, K., & Ghuge, C. (2017). Online Analytical Processing on Hadoop using Apache Kylin.International Journal Of Applied Information Systems,12(2), 1-5. doi: 10.5120/ijais2017451682
APACHE HADOOP VERSUS APACHE SPARK61 rna C, S., & Ansari, Z. (2017). Apache Pig - A Data Flow Framework Based on Hadoop Map Reduce.International Journal Of Engineering Trends And Technology,50(5), 271-275. doi: 10.14445/22315381/ijett-v50p244 S., R., & M., M. (2019). Approval of Data in Hadoop Using Apache Sentry.International Journal Of Computer Sciences And Engineering,7(1), 583-586. doi: 10.26438/ijcse/v7i1.583586 Saeed, F. (2018). Towards quantifying psychiatric diagnosis using machine learning algorithms and big fMRI data.Big Data Analytics,3(1). doi: 10.1186/s41044-018-0033- 0 Sherif, A., & Ravindra, A. (2018).Apache Spark deep learning cookbook. Birmingham: Packt Publishing Ltd. Singh, D., & Reddy, C. (2014). A survey on platforms for big data analytics.Journal Of Big Data,2(1). doi: 10.1186/s40537-014-0008-6 Singh, P., Anand, S., & B., S. (2017). Big Data Analysis with Apache Spark.International Journal Of Computer Applications,175(5), 6-8. doi: 10.5120/ijca2017915251 Sirisha, N., & V.D. Kiran, K. (2018). Authorization of Data In Hadoop Using Apache Sentry.International Journal Of Engineering & Technology,7(3.6), 234. doi: 10.14419/ijet.v7i3.6.14978 Suguna, S. (2016). Improvement of HADOOP Ecosystem and Their Pros and Cons in Big Data.International Journal Of Engineering And Computer Science. doi: 10.18535/ijecs/v5i5.57
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
APACHE HADOOP VERSUS APACHE SPARK62 Tromp, E., Pechenizkiy, M., & Gaber, M. (2017). Expressive modeling for trusted big data analytics: techniques and applications in sentiment analysis.Big Data Analytics,2(1). doi: 10.1186/s41044-016-0018-9 Vis, F. (2013). A critical reflection on Big Data: Considering APIs, researchers and tools as data makers.First Monday,18(10). doi: 10.5210/fm.v18i10.4878 Vychuzhanin, V. (2018). Distributed software complex on the basic former apache spark for processing the flow big data from complex technical systems.INFORMATICS AND MATHEMATICAL METHODS IN SIMULATION,8(2), 146-155. doi: 10.15276/imms.v8.no2.146 Wang, Z., Zhao, Y., Liu, Y., & Lv, C. (2018). A speculative parallel simulated annealing algorithm based on Apache Spark.Concurrency And Computation: Practice And Experience,30(14), e4429. doi: 10.1002/cpe.4429 Xu, J., & Liang, J. (2013). Research on a Distributed Storage Application with HBase.Advanced Materials Research,631-632, 1265-1269. doi: 10.4028/www.scientific.net/amr.631-632.1265 Yan, L., Liu, S., & Lao, D. (2014). Solr Index Optimization Based on MapReduce.Applied Mechanics And Materials,556-562, 3506-3509. doi: 10.4028/www.scientific.net/amm.556-562.3506 Zeide, E. (2017). The Structural Consequences of Big Data-Driven Education.Big Data,5(2), 164-172. doi: 10.1089/big.2016.0061 Zheng, Z. (2015). Introduction to Big Data Analytics and the Special Issue on Big Data Methods and Applications.Journal Of Management Analytics,2(4), 281-284. doi: 10.1080/23270012.2015.1116414