Data Analysis of CIS Faculty Using Hadoop and Pig: Assignment

Verified

Added on  2023/04/20

|2
|599
|300
Homework Assignment
AI Summary
This assignment solution demonstrates the use of Hadoop and Pig for analyzing CIS faculty data. The process begins with loading the 'CIS_FacultyList.csv' file into HDFS storage. The solution then utilizes Pig to create new datasets based on the degree level (Bachelors, Masters, Doctorate), years of teaching experience (less than 5 years, more than 5 years), and the location of the last degree (North America, elsewhere). The 'SPLIT' command is used extensively to categorize and filter the data. Finally, the solution includes commands to copy the resulting datasets from HDFS back to the local file system. The assignment covers essential aspects of data manipulation and processing within a Hadoop environment, including data loading, splitting, and dataset creation using Pig Latin.
Document Page
Upload the dataset ‘CIS_FacultyList.csv’ into HDFS storage on the cluster to your designated storage
space. If the copying is being made from windows instead of linux, one can use WinScp software which
acts as a linux platform. Using it one can efficient drag and drop files between windows and linux
operating system.
>Hadoop fs –copyFromLocal CIS_FacultyList.csv’
Before copying the csv file to HDFS, it is important to ensure the file is present in Linux local file
system.To check if the file is present in the local file system, use the below mentioned command.
>ls
2. Use Pig to create new datasets from the source file that categorises the instructors using the following
criteria: a. The degree level – Bachelors, Masters or Doctorate b. Number of years of teaching – less than
5 years, or more than 5 years c. Whether the last degree was obtained from North America, ******ope or
elsewhere HINT: Consider using the Pig Latin Split (Partition), For each and Group statement constructs.
>pig
The above command takes us to the grunt shell where all the pig commands can be executed.
> CIS_Faculty = LOAD 'CIS_FacultyList.csv' USING PigStorage(',') AS (ID:int, Name: chararray,
Location:chararray, Grade:chararray, Title: chararray, Join Date:string, LWD: string, Type: chararray,
Division: chararray, Reports To: chararray, Highest:chararray, Highest Qualification: chararray, Major:
chararray, University:chararray, All Qualifications from Profile: chararray, Courses Taught- Term
201510: chararray, MAJOR TEACHING FIELD: chararray, DOCUMENT OTHER PROFESSIONAL
CERTIFICATION CRITIERA Five Years Work Experience Teaching Excellence Professional:
chararray, Criteria: chararray);
The above command loads the data. To check the schema and its commands, use DUMP command
anywhere.
> DUMP CIS_Faulty
Creating a new dataset that categorises the instructors based on the Degree Level – Bachelors, Masters or
Doctorate using SPLIT
>SPLIT CIS_Faculty into Bachelor_Level if Highest == “Bachelor”, Master_Level if Highest =
“Masters”, Doctorate_Level if (Highest == “Doctorate” and Highest == “Ph.D”;
>DUMP Bachelor_Level;
>DUMP Master_Level;
>DUMP Doctorate_Level;
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
>SPLIT CIS_FacultyList into Less_Than_5 if Number_of_Years_of_Teaching < 5, More_Than_5 if
Number_of_Years_of_Teaching >5;
>DUMP Less_Than_5;
>DUMP More_Than_5;
>SPLIT CIS_FacultyList into NorthAmerica if University = “North America”, Elsewhere if University !=
“North America”;
>DUMP NorthAmerica;
>DUMP Elsewhere;
3. Copy the file from hdfs to the local file system storage. After executing the above commands, on the
whole 7 different datasets have been created. All the seven files can be either placed in one directory and
the directory be copied at once or the databases can be copied separately one by one using the below
mentioned commands.
>hadoop fs -copyToLocal Bachelor_level ~/Documents
>hadoop fs -copyToLocal Master_level ~/Documents
>hadoop fs -copyToLocal Doctorate_level ~/Documents
>hadoop fs -copyToLocal Less_Than_5 ~/Documents
>hadoop fs -copyToLocal More_Than_5 ~/Documents
>hadoop fs -copyToLocal NorthAmerica ~/Documents
>hadoop fs -copyToLocal Elsewhere ~/Documents
In this case the destination directory path is Documents which can be change as per the requirement.
chevron_up_icon
1 out of 2
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]