CSE3BDC Assignment: Cloudera, Hive, Impala, and Data Processing

Verified

Added on 2022/11/13

AI Summary

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

1. Installing & getting started with Cloudera QuickStart on VMWare
Install VMware for windows
1. Using the https://my.vmware.com/web/vmware/free link for download the VMware. And
we select the VMware Workstation player.
2. Downloaded VMware is installed by double clicking on the downloaded ".exe" file.
System automatically restarts when we install the VMware.
Install Cloudera for VMWare
1. Download the Cloudera quick start from the
"https://www.cloudera.com/downloads/quickstart_vms/5-12.html".
1

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

2. Fill the form and download the "zip" file.
3. Extract the zip file.
2

4. Open the installed VMware Workstation Plyer. And Select “cloudera-quickstart-vm-
5.12.0-0-vmware.vmx” to open a virtual machine.
5. Click the edit settings and allocate RAM size as "8GB" and "2 CPU cores".Go to
start-->run and type “msinfo32.exe” to find the information of our window systems.
3

6. Click play.
4

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

7. we will see the below screen,If it has started.
5

8. Click Launch Cloudera Express.
6

9. Login to the Cloudera manager by username and password and start all the services.
10. Click on Hue.
7

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

11. By using username and password we login to the hue.The impala is shown below.
8

12. If we done the process on Impala,then logout the Hue and Cloudera manager.
9

1. Process of HDFS file on impala:
The following steps are described how to process the csv file on impala:
1. Analyze a new impala instance.
2. Browse and load HDFS Data from local files.
3. Mark the impala table at existing data files.
4. Specify the impala table.
5. Query the table.
6. Data loading and querying.
6.1 Analyze a new impala instance:
Determine the techniques for finding your way about the tables and databases of an
unexplored impala instance.
An empty impala instance contains no tables but it contains two databases.
 “default” where new tables are created when you do not specify
any other database.
 “_Impala_builtins”, a sytem database used to occupy all the built-
in functions.
$ impala-shell -i localhost --quiet
Starting Impala Shell without Kerberos authentication
Welcome to the Impala shell. Press TAB twice to see a list of available
commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v...
[localhost:21000] > select version();
+-------------------------------------------
| version()
+-------------------------------------------
| impalad version ...
| Built on ...
+-------------------------------------------
[localhost:21000] > show databases;
+--------------------------+
10

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

| name |
+--------------------------+
| _impala_builtins |
| ctas |
| d1 |
| d2 |
| d3 |
| default |
| explain_plans |
| external_table |
| file_formats |
| tpc |
+--------------------------+
[localhost:21000] > select current_database();
+--------------------+
| current_database() |
+--------------------+
| default |
+--------------------+
[localhost:21000] > show tables;
+-------+
| name |
+-------+
| ex_t |
| t1 |
+-------+
[localhost:21000] > show tables in d3;
[localhost:21000] > show tables in tpc;
+------------------------+
| name |
+------------------------+
| city |
| customer |
11

| 1 |
+---+
[localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1;
+--------+--------+--------+--------+
| min(x) | max(x) | sum(x) | avg(x) |
+--------+--------+--------+--------+
| 1 | 4 | 10 | 2.5 |
+--------+--------+--------+--------+
[localhost:21000] > create table t2 (id int, word string);
[localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5,
'five');
[localhost:21000] > select word from t1 join t2 on (t1.x = t2.id);
+-------+
| word |
+-------+
| one |
| three |
+-------+
6.2 Browse and load HDFS data from local files:
This scenario clarifies how to create some very small tables. It is suitable for first-
time users with SQL features.
TAB1 and TAB2 are loaded with data files in HDFS.
A subset of data is copied from TAB1 and TAB2.
$ whoami
cloudera
$ hdfs dfs -ls /user
Found 3 items
drwxr-xr-x - cloudera cloudera 0 2013-04-22 18:54 /user/cloudera
drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history
13

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive
$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1
/user/cloudera/sample_data/tab2
6.3 Mark the impala table at existing data files:
A comfortable way to set up data for impala to access is to use an external table,
where the data exists in a set of HDFS. Mark the impala table at the directory containing those
files.
$ cd ~/cloudera/datasets
$ ./tpcds-setup.sh
... Downloads and unzips the kit, builds the data and loads it into HDFS ...
$ hdfs dfs -ls /user/hive/tpcds/customer
Found 1 items
-rw-r--r-- 1 cloudera supergroup 13209372 2013-03-22 18:09
/user/hive/tpcds/customer/customer.dat
$ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more
1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|
1936|CHILE||Javie
r.Lewis@VFAxlnZEvOx.org|2452508|
2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|
TOGO||Amy.Moses@
Ovk9KjHH.com|2452318|
3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|
18|9|1979|NIUE||
Latisha.Hamilton@V.com|2452313|
4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|
1983|MEXICO||Mic
hael.White@i.org|2452361|
5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|
1956|FIJI||Robert.
Moran@Hh.edu|2452469|
...
14

The following is saved as customer_setup.sql:
--
-- store_sales fact table and surrounding dimension tables only
--
create database tpcds;
use tpcds;
drop table if exists customer;
create external table customer
(
c_customer_sk int,
c_customer_id string,
c_current_cdemo_sk int,
c_current_hdemo_sk int,
c_current_addr_sk int,
c_first_shipto_date_sk int,
c_first_sales_date_sk int,
c_salutation string,
c_first_name string,
c_last_name string,
c_preferred_cust_flag string,
c_birth_day int,
c_birth_month int,
c_birth_year int,
c_birth_country string,
c_login string,
c_email_address string,
c_last_review_date string
)
row format delimited fields terminated by '|'
location '/user/hive/tpcds/customer';
drop table if exists customer_address;
create external table customer_address
15

(
ca_address_sk int,
ca_address_id string,
ca_street_number string,
ca_street_name string,
ca_street_type string,
ca_suite_number string,
ca_city string,
ca_county string,
ca_state string,
ca_zip string,
ca_country string,
ca_gmt_offset float,
ca_location_type string
)
row format delimited fields terminated by '|'
location '/user/hive/tpcds/customer_address'
The below command used to run this script:
impala-shell -i localhost -f customer_setup.sql
6.4 Specify the impala table:
Now we have updated the database metadata and we can confirm that impala access
the expected tables. Then the attributes of the table are examined. Now we created the table
named as default in database. By prepending the database, we qualify the name of the table. For
example "default. Customer".
[impala-host:21000] > show databases
Query finished, fetching results ...
default
Returned 1 row(s) in 0.00s
16

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

There are many ways to execute queries on impala:
Passing a set of commands:
$ impala-shell -i impala-host -f myquery.sql
Connected to localhost:21000
50000
Returned 1 row(s) in 0.19s
Using the impala-shell command in the mode of interactive:
$ impala-shell -i impala-host
Connected to localhost:21000
[impala-host:21000] > select count(*) from customer_address;
50000
Returned 1 row(s) in 0.37s
Passing a single command to the impala-shell:
$ impala-shell -i impala-host -q 'select count(*) from customer_address'
Connected to localhost:21000
50000
Returned 1 row(s) in 0.29s
6.6 Data loading and querying:
Data loading:
1. Browse the .csv file and get a data set from these local files.
2. Create table for load the data.
3. Load the data into created table.
Some important queries:
Inserting the data:
Query for insert the data into the table is given below:
INSERT OVERWRITE TABLE tab3
18

SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
FROM tab1 WHERE YEAR(col_3) = 2012;
Query for check the result:
SELECT * FROM tab3;
The result is:
+----+-------+---------+-------+-----+
| id | col_1 | col_2 | month | day |
+----+-------+---------+-------+-----+
| 1 | true | 123.123 | 10 | 24 |
| 2 | false | 1243.5 | 10 | 25 |
+----+-------+---------+-------+-----+
Examining contents of table:
The queries for examine the data in the table is given below.
SELECT * FROM tab1;
SELECT * FROM tab2;
SELECT * FROM tab2 LIMIT 5;
Results for that queries:
+----+-------+------------+-------------------------------+
| id | col_1 | col_2 | col_3 |
+----+-------+------------+-------------------------------+
19

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

| 1 | true | 123.123 | 2012-10-24 08:55:00 |
| 2 | false | 1243.5 | 2012-10-25 13:40:00 |
| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 |
| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5 | true | 243.325 | 1953-04-22 09:11:33 |
+----+-------+------------+-------------------------------+
+----+-------+---------------+
| id | col_1 | col_2 |
+----+-------+---------------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
| 60 | false | 243565423.325 |
| 70 | true | 243.325 |
| 80 | false | 243423.325 |
| 90 | true | 243.325 |
+----+-------+---------------+
+----+-------+-----------+
| id | col_1 | col_2 |
+----+-------+-----------+
| 1 | true | 12789.123 |
| 2 | false | 1243.5 |
| 3 | false | 24453.325 |
| 4 | false | 2423.3254 |
| 5 | true | 243.325 |
+----+-------+-----------+
Aggregate and join:
Query for aggregate and join is:
20

SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2)
FROM tab2 JOIN tab1 USING (id)
GROUP BY col_1 ORDER BY 1 LIMIT 5;
And the result is:
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325 | 1243.5 |
| true | 12789.123 | 243.325 |
+-------+-----------------+-----------------+
2. Upload a bank dataset into hive
Create a database in hive:
Create a table for a database
21

Load a csv file data into the hive table
Process of HDFS file on hive
1) To view the all databases in hive
22

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

2) To see the tables in the default database
3) To see the tables in the bankinfo database
23

4) To describe the info table
Hive queries:
1) Report the number of clients of each job category
"entrepreneur" 1
“blue-collar" 1
"services" 1
"management"2
24

"technician" 3
2) Report the average yearly balance for all people in each education category
"primary" 10.0
"secondary" 286.6666666666667
"tertiary" 1031.3333333333333
"unknown" 1506.0
3. References
Kanala, U., & Dr.K., S. (2018). Hadoop Technology for Bigdata Analytics. SSRN Electronic
Journal. doi: 10.2139/ssrn.3168340
Manoj Kumar Danthala, & Dr. Siddhartha Ghosh. (2015). Bigdata Analysis: Streaming Twitter
Data with Apache Hadoop and Visualizing using BigInsights. International Journal Of
Engineering Research And, V4(05). doi: 10.17577/ijertv4is050643
Manoj Kumar Danthala, & Dr. Siddhartha Ghosh. (2015). Bigdata Analysis: Streaming Twitter
Data with Apache Hadoop and Visualizing using BigInsights. International Journal Of
Engineering Research And, V4(05). doi: 10.17577/ijertv4is050643
R.Sahoo, P. (2012). Performance Overhead on Relational Join in Hadoop using
Hive/Pig/Streaming - A Comparative Analysis. International Journal Of Applied
Information Systems, 4(7), 15-20. doi: 10.5120/ijais12-450799
25