logo

Big Data Analytics | Task Report

   

Added on  2022-08-17

16 Pages3054 Words16 Views
Theoretical Computer ScienceData Science and Big DataArtificial IntelligenceStatistics and ProbabilityDatabases
 | 
 | 
 | 
Running head: BIG DATA ANALYTICS
Big Data Analytics
Name of the Student
Name of the University
Authors note
Big Data Analytics | Task Report_1

BIG DATA ANALYTICS
1
Task 1
Predictive Analytics
At the present moment generation of the big data is exponentially growing for both
structured as well as unstructured data from different sources. Predictive analytics is one of
those major data analytics process that is carried out over a selected big data set in order to
extract specific pattern in the information from the selected dataset as well as predict trends from
the discovered patterns.
Through the predictive analytics it is possible to determine the probable future result of
any specific event or even probability of some specific state/event which can occur in the future.
In this branch of data mining the analysts are responsible for predicting future possibilities or
trends of some specific events depending upon the past data (Singh et al. 2019). The prediction
is done with help of multiple independent variables and their impact on the dependent variables
through the techniques like decision trees, data clustering, neural networks as well different
regression modelling.
The predictive analytics comprises of different statistical methods which employed in
order to evolve the used models as mentioned above. Through Predictive analytics it is possible
for the analysts to deal with continuous as well as discrete changes in the data. In order to make
a predictive analysis successful used training dataset for the model must be demonstrative of test
dataset. Usually the training dataset for the model collected from the past events while test data
is considered from the future. Thus, if the desired variable which is to be predicted is not steady
over the considered period of time, then it is possible that predictions through the used models
are not likely to be successful. As an example, it can be stated that pharmaceutical companies
can predict that which drugs will be consumed at a higher rate compared to the other areas (due
to outbreak of disease) than some specific areas compared to others.
NoSQL Databases
While working with the Big data, the required storage usually requires enhanced
flexibility as well as scalability whenever required. The above mentioned two features are
addressed by the NoSQL databases. In the era of the big data NoSQL is getting popular due to
Big Data Analytics | Task Report_2

BIG DATA ANALYTICS
2
its capability of large, unstructured data which cannot be stored in the Relational database as it
requires object relational mapping.
The relational databases were not developed by keeping in mind the required scalability,
flexibility for the enormous unstructured data (Davoudian, Chen and Liu 2018). Due to the lack
of scalability the relational databases are facing challenges by the data generated by the modern
applications. Traditional relational databases are unable to leverage the advantage of faster
processing power which is available in the present time. following are the major categories for
the different available No-SQL databases.
Key-Value based databases: The key Value based No-SQL databases maintains Hash
table for keys and related values. The database uses an associative array to manage the
fundamental model in which every key is associated with specific one value. Example of such
databases are Amazon S3, Riak.
Document-based databases: This kind of databases consist of tagged elements that seems
to be documents. Instead of maintaining and structuring data in specific rows and columns
inside tables the data is stored in the unstructured way in varying schema. In this way this kind of
databases are capable of provide more flexibility in the modeling of the data. Example of
document-based database is CouchDB (Berwind et al. 2017).
Column-based storage- In case of column-based No-SQL databases, only one column is
responsible for storing data inside storage block. Every column continuously is written on disk or
in the memory. Example of such databases are HBase or Cassandra.
Graph-based database: The graph based network storage is considered as network
database which manly relies on using nodes and edges for storing data as well as in its
representation. In graph based data storage nodes are responsible for representing entities and
edge are relationships.
Through the use of the No-SQL databases it is possible to reduced complexity for the
pharmaceutical company as they need to store a wide variety of the attributes in order to
maintain enough details while making it a inexpensiveness option due to the lack of the object
relational mapping as compared to the relational databases.
Big Data Analytics | Task Report_3

BIG DATA ANALYTICS
3
Hadoop Ecosystem
The Hadoop ecosystem is capable of providing cost-effective, highly scalable, fault-
tolerant platforming order to store as well as analyze different types data formats coming from
different sources. There are mainly five components in the Hadoop ecosystem which includes the
HDFS, MapReduce, YARN, HCATALOG, Spark. The HDFS or the file used file system is
open-source cloning of Google File System. The HDFS is cheap as well as fault-tolerant for
which it is considered as the backbone of the complete Hadoop system. The file system contains
a namenode which is a single process on any Hadoop system that keeps track of different HDFS
blocks and multiple datanodes that are responsible for storing data.
Next component is the MapReduce framework. This is responsible for execution of the
distributed computations on the stored data and it requires two functions that are written by the
analyst. Both the input as well as output of MapReduce are different HDFS files. In addition to
that Map and Reduce functions communicate data over network. Such as writing to HDFS data
nodes and at the same time reading required data from different other nodes. Enormous big data
can be easily analyzed using Map and Reduce functions (Berwind et al. 2017). The reason
behind this can be stated as MapReduce can automatically execute multiple mapper and reducer
functions on a deployed cluster.
YARN is responsible for determining which job is to be executed at what time and on
which machines. For this the YARN keeps track of systems that have CPU and memory idle
without any map or reducer job. After this it searches for each job at where the required data
resides in the same node then it makes the decision to run map/reducer jobs close to data.
The HCATALOG works as a database in Hive which is the repository of all the available
tables/datasets in the node, names of the tables, columns, storage locations and so on. The Meta-
data repository for different datasets is also available on the HDFS which is known as the Hive
Catalog. The stored datasets can be of different simple text forms such as CSV.
In Hadoop Eco system the Spark is considered as the latest cluster computing framework
which will replace the complex MapReduce. In case of MapReduce, Mapper and Reducer
functions communicate with each other through the dataset files written on HDFS where as in
case of Spark it does not write the computed results to files and rather keeps all the data in
Big Data Analytics | Task Report_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
NoSQL Database and MongoDb
|5
|1252
|362

Business Intelligence Using Big Data
|16
|4212
|71

Database Technologies
|4
|765
|99

Databases and Conceptual Modelling
|4
|837
|31

Object Relational Database Implementation
|13
|4278
|41

A Critical Evaluation of the Big Data Approach to Drug Misuse Data Analytics
|6
|1296
|13