Assignment on Big data PDF

Verified

Added on  2021/05/31

|34
|5657
|33
AI Summary

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
1. Introduction
Big data is generally used for collecting large set of data and complex data which are difficult to
process using traditional applications and tools. The volume of data and complexity always
based upon the structure of data. There are many tools and engines are used to store and process
the bulk amount of data. One of the major SQL Engine is Impala, which is used to store and
manage the large amount of data and complex query structures for processing the data
efficiently. The goal of the Impala is to express the complicated queries directly with a familiar
and flexible of SQL syntax. The processing speed is enough to get an answer to an unexpected
question and complex queries. Impala brings a high degree of flexibility to the familiar database
and ETL process. Apache impala is used to manage the standard data analysis process in big
data. Apache Spark is also the fastest-growing analytics platform that can be used to replace
many older Hadoop-based frameworks. The spark using a machine optimized processing
method which is used for constructing the data frames for manipulating the data in memory table
and also analyzing the process of data. It is majorly used for evolving the processing method
of restructuring large volume data.
2. IMPALA
Impala
Apache impala is used for processing large amount of data that is stored in a computer cluster. And
also it is an open source massive parallel processing (MPP) SQL query engine. It is runs on an
Apache Hadoop. For Hadoop, it provides low latency and high performance when compared to the
other SQL engines. It is described as the open source software which is written in java and C++.The
Hadoop software like MapReduce, Apache Pig and Apache Hive are integrate the impala with
Hadoop to use the same data formats and files, resource management frameworks, metadata and
security.
1

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Features of Impala
Features of impala are given below:
Impala is an open source software under the Apache licenses
By using impala, we can access the data by the SQL queries
We can store the data in storage systems such as Amazon s3, HDFS and Apache HBase
It supports the data processing in memory and Hadoop security.
From Apache Hive, Impala uses metadata, SQLsyntax and ODBC driver.
It supports various file formats like RCFile, Sequence File, Parquet, LZO and Avro.
Impala integrated with the Zoom data, Pentaho, Tableau and Micro Strategy. These are
the business intelligences.
Reasons for using Impala
Based on daemon processes, impala implements a distributed architecture that run on the
same machines.
It not based on MapReduce algorithms like Apache Hive.
Impala integrates a multi-user performance of a traditional analytic database and the
multi-user performance with the flexibility and scalability of Apache Hadoop, by
applying standard components like Metastore, Sentry, HDFS, yarn and HBase.
It reduces the latency of utilizing MapReduce.
Impala uses the user interface as Apache Hive, SQL syntax, same metadata and ODBC
driver to providing a unified and familiar platform for real-time queries or batch-oriented.
It can read all the file formats like Avro, Parquet and RCfile.
By using impala, users can communicate with HBase or HDFS using SQL queries.
2
Document Page
Architecture of Impala
Impala is separated from its storage engine and directly access the data through a specialized
distributed query engine for avoids the latency. In Hadoop cluster, it runs on a number of
systems. The three main components in the impala architecture. They are given below:
Impala metadata.
Impala daemon.
Impala Statestore.
3. Data Analysis using IMPALA
Data Analysis Using Impala
Impala is an SQL query engine with Massive Parallel Processing (MPP).It is used for processing
a large amount of data. This data will be stored in Hadoop cluster. This engine provides very low
latency and very high performance when compared to other SQL engines. This impala is an open
source software. It has to be written in Java and C++.
3
Document Page
Steps for Analyzing Data
Loading/Ingesting the data to be analyzed
Preparing/Cleaning the data
Analyzing the data
Visualizing results/Generating report
Step 1: Loading or ingesting the data to be analyzed
The first process is to store the in to Hadoop Distributed File System(HDFS).Then Impala starts
processing this data. If the data won’t be processed then perform Extract Transform Load (ETL)
for loading the data which is from HDFS to Impala. Load Data statements are used to load the
data. This Load Data statements has some key properties that are listed below:
After the Data files are loaded it will be moved to the Impala data directory from HDFS.
Each and every data files has an individual file name. So, first give a file name for the loaded
data files which is from HDFS. Otherwise give a directory name to load all the data files in
impala. In HDFS path the wild card pattern was not supported.
Syntax for the Load Data statement:
Using Load Data operation there is no changes in files. That file contains the same contents,
same name, and the destination path all also same. There are no changes in these contents. The
main consumer for the data is Impala. The stored files in HDFS will be removed after the impala
table will be dropped.
Step 2: Preparing/Cleaning the data
4

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Data cleaning is the important phase in impala. Because this dataset contains some errors and
outliers. Some kind of dataset in only depend on the type of quantitative but it is not qualitative.
And more over this dataset contains some discrepancies in their codes or names. And it lacks our
attributes of interest for analyzing process.
The data preparation is most important task. Because it is used to correct the inconsistency of
data in the dataset. And it is also used for smooth the noise data in that dataset and fill the
missing values or missing attributes in that dataset. Sometimes dataset contains duplicate records
or inconsistent data or errors in random manner. These kind of issues will be solved in the part of
data preparation or cleaning. The main advantage for doing this data cleaning was to provide
data quality. In impala various noisy data’s are find out manually or cleaned manually by this
technique. Missing values will be cleaned by either removing those rows or filling values in that
row.
The basics steps for cleaning or preparing the data:
Editing: It is used to correct the inconsistent data, incomplete data illegible data that are stored
in the impala table for processing.
Cleaning: It is used to clean the data from inconsistent data. If any fault logics or queries will be
applied it will removed using cleaning. It is used to remove the extreme values in the table.
Questionnaire checking: It is used to eliminate the unwanted questionnaires. If the
questionnaires may be little variant from another, instructions could not be followed or it may be
incomplete.
Coding: It is used to verify if the code is typically assigns a numeric codes to answer. And check
that if the statistical techniques will be applied or not.
Transcribing: It is used to transfer the data for easy accessing by people.
Strategy selection for analysis: choose the appropriate data analysis strategy for further
processing.
Statistical adjustments: It was the most important step in data cleaning. It is used to require the
correct scale transformations weighting for the analyzed data.
5
Document Page
Step 3: Analyzing the data
Data Analysis is used for cleaning, modelling the data, inspect the data and transforming the
data.DA can be classified into two types called confirmatory data analysis and explanatory data
analysis. This CDA is used to confirm the hypotheses. If the data can be prepared and cleaned
and then it can be analyzed. In this EDA is used to find out the message that are contained in the
data. With the help of exploration it performs additional data cleaning process.
Step 4: Visualizing results/Generating report
Here visualizing results with the help of the Matplotlib in python. It is the data
visualization library with multi-platform. The main feature of this Matplotlib is it has the ability
to work with multiple graphical back ends and multiple operating system. It is more flexible to
provide the format of the output types that the user will be needed. Because it has the multiple
output types and back ends for getting process. This Matplotlib is a cross platform. It follows
approach called everything to everyone. And it was the one of the greatest strength in Matplotlib.
This Matplotlib contains the large developer base and user base.
Work with Matplotlib:
Importing Matplotlib: Here NumPy will be used by the short form of np and Pandas will be
used by the short form of pd. Matplotlib uses some standard shorthand’s.
Here plt is an interface.
Setting styles: plt.style is a directive. It is used to select an aesthetic styles for the figures.
For example classic Matplotlib style will be given.
Plotting from a script: If the script contains the Matplotlib then the function plt.show () is used to
find out the current active figures in the graphical mode. It is an interactive window. It is mainly
used for display the figure.
6
Document Page
For example:
Plotting with the help of IPython shell: With the help of IPython shell plotting is very
convenient by using Matplotlib. This mode will be enabled by using the following command
%matplotlib.
7

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
After running this command the following graphical result will be produced,
Saving the figure format to the file format: The file will be saved by using the following
command.
8
Document Page
4. Conversion of SQL file to HDFS file format using sqoop:
Install the sqoop from the following link:
https://archanaschangale.wordpress.com/2013/09/14/sqoop-installation/
1. Importing a table into HDFS
Command for import
$ sqoop import --connect jdbc:mysql://localhost/databasename --username $USER_NAME --
password $PASSWORD$ --table tablename --m 1
Execute the sqooq import.Here, we are using database ‘testDb’
$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123
--table student --m 1
2. Import all rows of a table in MySQL, but specific columns of the table
$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123
--table student --columns "name" -m 1
3. Import all columns, filter rows using where clause
$ sqoop import --connect jdbc:mysql://localhost/testDb --username root --password hadoop123
--table student --where "id>1" -m 1 --target-dir /user/hduser/ar
9
Document Page
5. Installing & getting started with Cloudera QuickStart on VMWare
Install VMware for windows
1. Using the https://my.vmware.com/web/vmware/free link for download the VMware. And
we select the VMware Workstation player.
2. Downloaded VMware is installed by double clicking on the downloaded ".exe" file.
System automatically restarts when we install the VMware.
Install Cloudera for VMWare
1. Download the Cloudera quick start from the
"https://www.cloudera.com/downloads/quickstart_vms/5-12.html".
10

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
2. Fill the form and download the "zip" file.
3. Extract the zip file.
11
Document Page
4. Open the installed VMware Workstation Plyer. And Select “cloudera-quickstart-vm-
5.12.0-0-vmware.vmx” to open a virtual machine.
5. Click the edit settings and allocate RAM size as "8GB" and "2 CPU cores".Go to
start-->run and type “msinfo32.exe” to find the information of our window systems.
12
Document Page
6. Click play.
13

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
7. we will see the below screen,If it has started.
14
Document Page
8. Click Launch Cloudera Express.
15
Document Page
9. Login to the Cloudera manager by username and password and start all the services.
10. Click on Hue.
16

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
11. By using username and password we login to the hue.The impala is shown below.
17
Document Page
12. If we done the process on Impala,then logout the Hue and Cloudera manager.
18
Document Page
6. Process of HDFS file on impala:
The following steps are described how to process the csv file on impala:
1. Analyze a new impala instance.
2. Browse and load HDFS Data from local files.
3. Mark the impala table at existing data files.
4. Specify the impala table.
5. Query the table.
6. Data loading and querying.
6.1 Analyze a new impala instance:
Determine the techniques for finding your way about the tables and databases of an
unexplored impala instance.
An empty impala instance contains no tables but it contains two databases.
“default” where new tables are created when you do not specify
any other database.
“_Impala_builtins”, a sytem database used to occupy all the built-
in functions.
$ impala-shell -i localhost --quiet
Starting Impala Shell without Kerberos authentication
Welcome to the Impala shell. Press TAB twice to see a list of available
commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v...
[localhost:21000] > select version();
+-------------------------------------------
| version()
+-------------------------------------------
| impalad version ...
| Built on ...
+-------------------------------------------
[localhost:21000] > show databases;
+--------------------------+
19

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
| name |
+--------------------------+
| _impala_builtins |
| ctas |
| d1 |
| d2 |
| d3 |
| default |
| explain_plans |
| external_table |
| file_formats |
| tpc |
+--------------------------+
[localhost:21000] > select current_database();
+--------------------+
| current_database() |
+--------------------+
| default |
+--------------------+
[localhost:21000] > show tables;
+-------+
| name |
+-------+
| ex_t |
| t1 |
+-------+
[localhost:21000] > show tables in d3;
[localhost:21000] > show tables in tpc;
+------------------------+
| name |
+------------------------+
| city |
| customer |
20
Document Page
| customer_address |
| customer_demographics |
| household_demographics |
| item |
| promotion |
| store |
| store2 |
| store_sales |
| ticket_view |
| time_dim |
| tpc_tables |
+------------------------+
[localhost:21000] > show tables in tpc like 'customer*';
+-----------------------+
| name |
+-----------------------+
| customer |
| customer_address |
| customer_demographics |
+-----------------------+
The following example contains the simple table for performing the following queries:
[localhost:21000] > insert into t1 values (1), (3), (2), (4);
[localhost:21000] > select x from t1 order by x desc;
+---+
| x |
+---+
| 4 |
| 3 |
| 2 |
21
Document Page
| 1 |
+---+
[localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1;
+--------+--------+--------+--------+
| min(x) | max(x) | sum(x) | avg(x) |
+--------+--------+--------+--------+
| 1 | 4 | 10 | 2.5 |
+--------+--------+--------+--------+
[localhost:21000] > create table t2 (id int, word string);
[localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5,
'five');
[localhost:21000] > select word from t1 join t2 on (t1.x = t2.id);
+-------+
| word |
+-------+
| one |
| three |
+-------+
6.2 Browse and load HDFS data from local files:
This scenario clarifies how to create some very small tables. It is suitable for first-
time users with SQL features.
TAB1 and TAB2 are loaded with data files in HDFS.
A subset of data is copied from TAB1 and TAB2.
$ whoami
cloudera
$ hdfs dfs -ls /user
Found 3 items
drwxr-xr-x - cloudera cloudera 0 2013-04-22 18:54 /user/cloudera
drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history
22

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive
$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1
/user/cloudera/sample_data/tab2
6.3 Mark the impala table at existing data files:
A comfortable way to set up data for impala to access is to use an external table,
where the data exists in a set of HDFS. Mark the impala table at the directory containing those
files.
$ cd ~/cloudera/datasets
$ ./tpcds-setup.sh
... Downloads and unzips the kit, builds the data and loads it into HDFS ...
$ hdfs dfs -ls /user/hive/tpcds/customer
Found 1 items
-rw-r--r-- 1 cloudera supergroup 13209372 2013-03-22 18:09
/user/hive/tpcds/customer/customer.dat
$ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more
1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|
1936|CHILE||Javie
r.Lewis@VFAxlnZEvOx.org|2452508|
2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|
TOGO||Amy.Moses@
Ovk9KjHH.com|2452318|
3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|
18|9|1979|NIUE||
Latisha.Hamilton@V.com|2452313|
4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|
1983|MEXICO||Mic
hael.White@i.org|2452361|
5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|
1956|FIJI||Robert.
Moran@Hh.edu|2452469|
...
23
Document Page
The following is saved as customer_setup.sql:
--
-- store_sales fact table and surrounding dimension tables only
--
create database tpcds;
use tpcds;
drop table if exists customer;
create external table customer
(
c_customer_sk int,
c_customer_id string,
c_current_cdemo_sk int,
c_current_hdemo_sk int,
c_current_addr_sk int,
c_first_shipto_date_sk int,
c_first_sales_date_sk int,
c_salutation string,
c_first_name string,
c_last_name string,
c_preferred_cust_flag string,
c_birth_day int,
c_birth_month int,
c_birth_year int,
c_birth_country string,
c_login string,
c_email_address string,
c_last_review_date string
)
row format delimited fields terminated by '|'
location '/user/hive/tpcds/customer';
drop table if exists customer_address;
create external table customer_address
24
Document Page
(
ca_address_sk int,
ca_address_id string,
ca_street_number string,
ca_street_name string,
ca_street_type string,
ca_suite_number string,
ca_city string,
ca_county string,
ca_state string,
ca_zip string,
ca_country string,
ca_gmt_offset float,
ca_location_type string
)
row format delimited fields terminated by '|'
location '/user/hive/tpcds/customer_address'
The below command used to run this script:
impala-shell -i localhost -f customer_setup.sql
6.4 Specify the impala table:
Now we have updated the database metadata and we can confirm that impala access
the expected tables. Then the attributes of the table are examined. Now we created the table
named as default in database. By prepending the database, we qualify the name of the table. For
example "default. Customer".
[impala-host:21000] > show databases
Query finished, fetching results ...
default
Returned 1 row(s) in 0.00s
25

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
[impala-host:21000] > show tables
Query finished, fetching results ...
customer
customer_address
Returned 2 row(s) in 0.00s
[impala-host:21000] > describe customer_address
+------------------+--------+---------+
| name | type | comment |
+------------------+--------+---------+
| ca_address_sk | int | |
| ca_address_id | string | |
| ca_street_number | string | |
| ca_street_name | string | |
| ca_street_type | string | |
| ca_suite_number | string | |
| ca_city | string | |
| ca_county | string | |
| ca_state | string | |
| ca_zip | string | |
| ca_country | string | |
| ca_gmt_offset | float | |
| ca_location_type | string | |
+------------------+--------+---------+
Returned 13 row(s) in 0.01
6.5 Query the table:
We can query the data that contained in the tables. Depending on our configuration,
impala coordinates the query execution across a multi nodes or single node.
26
Document Page
There are many ways to execute queries on impala:
Passing a set of commands:
$ impala-shell -i impala-host -f myquery.sql
Connected to localhost:21000
50000
Returned 1 row(s) in 0.19s
Using the impala-shell command in the mode of interactive:
$ impala-shell -i impala-host
Connected to localhost:21000
[impala-host:21000] > select count(*) from customer_address;
50000
Returned 1 row(s) in 0.37s
Passing a single command to the impala-shell:
$ impala-shell -i impala-host -q 'select count(*) from customer_address'
Connected to localhost:21000
50000
Returned 1 row(s) in 0.29s
6.6 Data loading and querying:
Data loading:
1. Browse the .csv file and get a data set from these local files.
2. Create table for load the data.
3. Load the data into created table.
27
Document Page
Some important queries:
Inserting the data:
Query for insert the data into the table is given below:
INSERT OVERWRITE TABLE tab3
SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
FROM tab1 WHERE YEAR(col_3) = 2012;
Query for check the result:
SELECT * FROM tab3;
The result is:
+----+-------+---------+-------+-----+
| id | col_1 | col_2 | month | day |
+----+-------+---------+-------+-----+
| 1 | true | 123.123 | 10 | 24 |
| 2 | false | 1243.5 | 10 | 25 |
+----+-------+---------+-------+-----+
Examining contents of table:
The queries for examine the data in the table is given below.
SELECT * FROM tab1;
SELECT * FROM tab2;
SELECT * FROM tab2 LIMIT 5;
28

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Results for that queries:
+----+-------+------------+-------------------------------+
| id | col_1 | col_2 | col_3 |
+----+-------+------------+-------------------------------+
| 1 | true | 123.123 | 2012-10-24 08:55:00 |
| 2 | false | 1243.5 | 2012-10-25 13:40:00 |
| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 |
| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5 | true | 243.325 | 1953-04-22 09:11:33 |
+----+-------+------------+-------------------------------+
+----+-------+---------------+
| id | col_1 | col_2 |
+----+-------+---------------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
| 60 | false | 243565423.325 |
| 70 | true | 243.325 |
| 80 | false | 243423.325 |
| 90 | true | 243.325 |
+----+-------+---------------+
+----+-------+-----------+
| id | col_1 | col_2 |
+----+-------+-----------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
+----+-------+-----------+
29
Document Page
Aggregate and join:
Query for aggregate and join is:
SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2)
FROM tab2 JOIN tab1 USING (id)
GROUP BY col_1 ORDER BY 1 LIMIT 5;
And the result is:
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325 | 1243.5 |
| true | 12789.123 | 243.325 |
+-------+-----------------+-----------------+
30
Document Page
7. During Game-Tweets Analysis (2000)
1. Extract and present the average number of tweets per ‘during-game’ minute for the top 10
(i.e. most tweeted about during the event) games.
Rank
1 13
2 7
3 4
4 15
5 7
6 8
7 7
8 13
9 15
10 12
2. Rank the games according to number of distinct users tweeting ‘during-game’ and
present the information for the top 10 games, including the number of distinct users for
each.
Game ID 11, 14 and 16
3. Find the top 3 teams that played in the most games. Rank their games in order of highest
number of ‘during-game’ tweets (include the frequency in your output).
20 , 21, 18
31

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
4. Find the top 10 (ordered by number of tweets) games which have the highest ‘during-
game’ tweeting spike in the last 10 minutes of the game.
30 ,31 , 34 , 35 , 35 , 4 , 40 , 45 , 48 , 7
5. As well as the official hash tags, each tweet may be labeled with other hash tags.
Restricting the data to ‘during-game’ tweets, list the top 10 most common non official
hash tags over the whole dataset with their frequencies.
6. Find the top 10 games with the highest ‘during-game’ tweeting spikes that are not within
10 minutes (+/-) of any goal
35 , 4 , 40 , 45 , 48 , 7, 30 ,31 , 34
7. Draw the graph of the progress of one of the games (the game you choose should have a
complete set of tweets for the entire duration of the game). It may be useful to summarize the
tweet frequencies in 1-minute intervals.
32
Document Page
8. Conclusion
One of the best open sources is used in the data analysis. The open source is also known as
impala. Impala is used to analysis or access the data. It is stored on hadoop data nodes
without the movement of data. Access the data using the SQL queries and impala. Compare
to other SQL engines, impala is very faster access for the data in HDFS. Impala is used more
developed technologies for analysis like SQL syntax from apache hive, ODBC driver and
Metadata etc. Impala is used to handle the huge amount of data. It is more efficiency
compare to hive. Pig is one of the data analysis tool. But it is not suitable for this particular
data set. Pig is more suitable for the complex queries. Impala is contract handler for the real
time adhoc queries. It is used to execute the SQL queries in the interactive exploratory
analytics. The impala scales performance contain number of hosts, network devices etc.
The impala is mainly used to analyze the data. Impala contains huge amount of data like
petabytes. The data is managed by the impala. It is one of the best tool for manage and
analyze the data on hadoop. It is follows relational model for the data analyzing. The impala
is developed by c++. It is contains schema based data model. The impala is provides JDBC
and ODBC API’s for the analyzing. It is most popular tool and supports all languages. But it
is not support for the triggers. Impala is used to store the data in HBase and Amazon s3
without using the knowledge of java. It is not use the mapreduce jobs. The impala is used to
analyze the data in four steps. The analyze process is completed in the four steps.
The data is loaded and it is waited for the analysis. It is called loading of data. The new data
is created or imported for the urgent uses or the data will be stored in the database. The
process is called data ingesting. There is the first step for the data analysis. The second step is
data preparing and cleaning. The new data is prepared and the corrupt will be detected or
corrected for the analysis. For example, the incorrect records are removed from the record
set. After cleaning, the data set should be consistent. The next step is analyzing the data. Data
analysis is a process of inspecting and modeling data. The data analysis contains more
33
Document Page
different types of processes. There are data requirements, data collection, data processing,
data product, communication and exploratory data analysis etc. Final step is data
virtualization. It is used to represents any approach for the data management. It is allows the
application for retrieving and manipulating the data. Consider the all process for generating
the report. The impala is an interactive business intelligence and it contains popular tools. It
is reduces overall cost management for data analysis. Many open source tools are available in
market. But the impala is in-built supporting of all processing. The data analysis is quickly
done by the impala and it is obtain the results in real time.
9. References
Alternative Approaches to Big Data Analysis. (2017). International Journal Of Recent Trends In
Engineering And Research, 3(4), 256-260. doi: 10.23883/ijrter.2017.3142.uwq8h
Bopche, A. (2017). Big Data Analysis using Hadoop. International Journal Of Computer
Applications, 179(6), 1-4. doi: 10.5120/ijca2017915960
Sharma, D. (2017). Challenges Involved in Big Data Processing & Methods to Solve Big Data
Processing Problems. International Journal For Research In Applied Science And
Engineering Technology, V(VIII), 841-844. doi: 10.22214/ijraset.2017.8118
Singh, P., Anand, S., & B., S. (2017). Big Data Analysis with Apache Spark. International
Journal Of Computer Applications, 175(5), 6-8. doi: 10.5120/ijca2017915251
Tummalapalli, S., & Machavarapu, V. (2016). Managing Mysql Cluster Data Using Cloudera
Impala. Procedia Computer Science, 85, 463-474. doi: 10.1016/j.procs.2016.05.193
Xiao, Q. (2014). Big Data analysis: Intelligent Transportation Development Engine. Urban
Transportation & Construction, 1(1), 11. doi: 10.18686/utc.v1i1.3
34
1 out of 34
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]